From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55CC6CE7AB0 for ; Mon, 9 Sep 2024 07:22:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DC36D6B0133; Mon, 9 Sep 2024 03:22:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D72E76B0134; Mon, 9 Sep 2024 03:22:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C142B6B0135; Mon, 9 Sep 2024 03:22:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A56AF6B0133 for ; Mon, 9 Sep 2024 03:22:51 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 5E28F41DBA for ; Mon, 9 Sep 2024 07:22:51 +0000 (UTC) X-FDA: 82544357742.13.B9C38D5 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.13]) by imf05.hostedemail.com (Postfix) with ESMTP id 59B3B10000D for ; Mon, 9 Sep 2024 07:22:48 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FfdbkzL1; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725866468; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dj1w+0hV8WMGF6V0wW7JevD9ZkdK/xEf6CQLRne3O30=; b=2kEaU6XByo0NH4sOU3jzhPZmjJB9xU6egERtvOWZKAvC55K1ccR5E9LhxWVNiez82D+wXk TRdxsoJEhEcGtqIW3PpCKu8g0ZIDZNyZozOzOCXxSshlDOjkj3tCy3/b/KFDMaEYVgAl/b fBEHHITuR9NdwB0fFZT4ZN2FwzsNGXg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725866468; a=rsa-sha256; cv=none; b=bPOL/2zANLDuT1BXNV4QcZFDCWPC+dxg2RcVe4vNxHGi8fo2c3dWR/B8G96gib/AzgV0IC BrAvLaYqg69Id0bW5hlvecc400VP0kSDfykPDy7g4gc/M+TS+YO/TM3BjokOmh+bPgGtNU acHlc7B0I9aqwnib8phggObQHkFCOpg= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=FfdbkzL1; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.13 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1725866569; x=1757402569; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=HHYCMioBl3rZ8NMchmxx02IbaAD4rnswOq1CX26vmWg=; b=FfdbkzL1x3ucO/np2eUltQOgCz15dY2OQ41IyLoaxRtK2Og9IkKHkATJ nbhAt1LV/MnA47fbYUkJUOAyv+6IBqbJu0h/R/+DS4joIE5jXColix3Im 18uTZj7ZLPjxt5KvYTU+5Ejw2FhXDADcil0blTn22O2jHsROOyvfJVx6L eTmC3TkFPu37Iwx/FBRqVYHWTF4rHCtllRwAC/UZV7K+NUTPuEBhWvHX7 exRXIQKIDE/vcwql4HDjY1uu4Ob+Ndw/THGj9bZ4ZstPJ1UkkQnMM/Tls BSaaC3g8h6C9Quyyph1XTuCI52Xw4MTm9wh1pOvD5sc1Gm2Sgl1X+mkVZ A==; X-CSE-ConnectionGUID: 71PWMAxFSBeHRTNNSfNLIA== X-CSE-MsgGUID: RYs+0iYnTHSfLMxMD1/+0A== X-IronPort-AV: E=McAfee;i="6700,10204,11189"; a="27466233" X-IronPort-AV: E=Sophos;i="6.10,213,1719903600"; d="scan'208";a="27466233" Received: from orviesa004.jf.intel.com ([10.64.159.144]) by fmvoesa107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2024 00:22:47 -0700 X-CSE-ConnectionGUID: VRmB02zvTlmGloW/9qXSHg== X-CSE-MsgGUID: jr46ujuNR6eZpe1z4xnnBQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.10,213,1719903600"; d="scan'208";a="71524171" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa004-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Sep 2024 00:22:45 -0700 From: "Huang, Ying" To: Chris Li Cc: Andrew Morton , Kairui Song , Hugh Dickins , Ryan Roberts , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Subject: Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list In-Reply-To: (Chris Li's message of "Mon, 26 Aug 2024 14:26:19 -0700") References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> <20240730-swap-allocator-v5-2-cb9c148b9297@kernel.org> <87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com> <871q2lhr4s.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Mon, 09 Sep 2024 15:19:11 +0800 Message-ID: <874j6p1ehc.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 59B3B10000D X-Stat-Signature: r1rw7onkdfjsru6ftwm6n8rwz6pno7re X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1725866568-494996 X-HE-Meta: U2FsdGVkX18WQINK7C/qfvng+0Fj2gTgeKajyyYhl15K9GCRHEbU3eoKfRTJoSddoehlG8L2mam6lZG/EADRjBMcHg3BuPE8n8+ABe4V3hiIv/SvYYPd3ubcr5ZkHIGBWxj67jpovCpe3xRyAHenARgkx9XlCYSM3qacCFXxqV4pqx98NRADys9giR5lmsbTLkwrXMmv2Ml1FjEIiFJ+ihkX3F79pA4W0YucuaktHYZEvPZpxgCcWpre6jolii+MujpafZhG2EgyQAOJpa1gOh10clGyobnIt8jNeJBfMUCchtQgHERf3dYWWYZxKtoNkEv/g/RnunS8ZYJl0/2Qp0Pu/bCXhFVwv8UHBAbHEvkRiPzTl6wV4wPw7C7+nm+eHamz3/YTEc6zEpVTFxd/m1aF5nb8RpgTlg6vQZu5KdKO9CvU6zbOPpR1/2/F7dZk4Q3rG8mtDNlDuCBTGpbTLUSo1tB/IUyjmTUIPxr6dqESpLmrClc04N21fDhLvDFAcF34c6fQlMWhzgDxNV9FToyIupCDgkRpzmdlMHYuulmdEmcorEH22O23w1uidWyX1n5Xpw5aTL+WYwdw2wkyYLA1picfsNOEo7wvcKxSPHNX667D41z6iKA9K36Puwn8cn/xYq8jt6j65ng0O5lUQyMo17QC1XCZRz5ZvUbaIZ96jxs0XFCRPU1NuuMWR/vkPxoKFjZ9qLzdj8HqJIq5AnpbvC9VomXOOGJQs6r8CVekFxz7TEPbPuxXB9JiVzYQ52XBhezPdC3IvrtyQPL4r9NrnvQcuZE08l5d48get+9ZPVj+IUG5qSdOMOPJnTMSzcknvmYoR5CaUJNZEsGdpgr+4f0p7UsRviGJ2VgrF03MhCVxLH+9Ep8U47D8jTPd512FO2G0uXh8NFNfiA9NVodJCWy6S4S/+NQJTjhBI/5DhedMTssd4OPkCmK8UmjXG693SMiNrlu3pMME85i D3ogYos2 kCniX4LK4GC+e3latqWUAG6Glr2wZlwLMFyNqjRZl01+D1R53rT9t6ZQ+1lgltMVN1ZnOMhYIoJbydcx7pWyQ7BkiqyEEfXyTZWx0mEt8WFp7kAVwLS4s3Q1JgSru1CD1KMsOK4ormt8wvYgZSmxBhYWt/eS4GBcMSxkuqtfaxzgKyg2S4qzBtk3DKC6uvb4pHLGLOlvkLVmJ/nWtEy3Q9YpGKCEHvvrEG576 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li writes: > On Mon, Aug 19, 2024 at 1:11=E2=80=AFAM Huang, Ying wrote: >> > BTW, what is your take on my previous analysis of the current SSD >> > prefer write new cluster can wear out the SSD faster? >> >> No. I don't agree with you on that. However, my knowledge on SSD >> wearing out algorithm is quite limited. > > Hi Ying, > > Can you please clarify. You said you have limited knowledge on SSD > wearing internals. Does that mean you have low confidence in your > verdict? Yes. > I would like to understand your reasoning for the disagreement. > Starting from which part of my analysis you are disagreeing with. > > At the same time, we can consult someone who works in the SSD space > and understand the SSD internal wearing better. I think that is a good idea. > I see this is a serious issue for using SSD as swapping for data > center usage cases. In your laptop usage case, you are not using the > LLM training 24/7 right? So it still fits the usage model of the > occasional user of the swap file. It might not be as big a deal. In > the data center workload, e.g. Google's swap write 24/7. The amount of > data swapped out is much higher than typical laptop usage as well. > There the SSD wearing out issue would be much higher because the SSD > is under constant write and much bigger swap usage. > > I am claiming that *some* SSD would have a higher internal write > amplification factor if doing random 4K write all over the drive, than > random 4K write to a small area of the drive. > I do believe having a different swap out policy controlling preferring > old vs new clusters is beneficial to the data center SSD swap usage > case. > It come downs to: > 1) SSD are slow to erase. So most of the SSD performance erases at a > huge erase block size. > 2) SSD remaps the logical block address to the internal erase block. > Most of the new data rewritten, regardless of the logical block > address of the SSD drive, grouped together and written to the erase > block. > 3) When new data is overridden to the old logical data address, SSD > firmware marks those over-written data as obsolete. The discard > command has the similar effect without introducing new data. > 4) When the SSD driver runs out of new erase block, it would need to > GC the old fragmented erased block and pertectial write out of old > data to make room for new erase block. Where the discard command can > be beneficial. It tells the SSD firmware which part of the old data > the GC process can just ignore and skip rewriting. > > GC of the obsolete logical blocks is a general hard problem for the SSD. > > I am not claiming every SSD has this kind of behavior, but it is > common enough to be worth providing an option. > >> > I think it might be useful to provide users an option to choose to >> > write a non full list first. The trade off is more friendly to SSD >> > wear out than preferring to write new blocks. If you keep doing the >> > swap long enough, there will be no new free cluster anyway. >> >> It depends on workloads. Some workloads may demonstrate better spatial >> locality. > > Yes, agree that it may happen or may not happen depending on the > workload . The random distribution swap entry is a common pattern we > need to consider as well. The odds are against us. As in the quoted > email where I did the calculation, the odds of getting the whole > cluster free in the random model is very low, 4.4E10-15 even if we are > only using 1/16 swap entries in the swapfile. Do you have real workloads? For example, some trace? -- Best Regards, Huang, Ying