From: Chris Li <chrisl@kernel.org>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Kairui Song <kasong@tencent.com>,
Hugh Dickins <hughd@google.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Kalesh Singh <kaleshsingh@google.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Barry Song <baohua@kernel.org>
Subject: Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list
Date: Mon, 26 Aug 2024 14:26:19 -0700 [thread overview]
Message-ID: <CACePvbUp1-BsWgYX0hWDVYT+8Q2w_E-0z5up==af_B5KJ7q=VA@mail.gmail.com> (raw)
In-Reply-To: <871q2lhr4s.fsf@yhuang6-desk2.ccr.corp.intel.com>
On Mon, Aug 19, 2024 at 1:11 AM Huang, Ying <ying.huang@intel.com> wrote:
> > BTW, what is your take on my previous analysis of the current SSD
> > prefer write new cluster can wear out the SSD faster?
>
> No. I don't agree with you on that. However, my knowledge on SSD
> wearing out algorithm is quite limited.
Hi Ying,
Can you please clarify. You said you have limited knowledge on SSD
wearing internals. Does that mean you have low confidence in your
verdict?
I would like to understand your reasoning for the disagreement.
Starting from which part of my analysis you are disagreeing with.
At the same time, we can consult someone who works in the SSD space
and understand the SSD internal wearing better.
I see this is a serious issue for using SSD as swapping for data
center usage cases. In your laptop usage case, you are not using the
LLM training 24/7 right? So it still fits the usage model of the
occasional user of the swap file. It might not be as big a deal. In
the data center workload, e.g. Google's swap write 24/7. The amount of
data swapped out is much higher than typical laptop usage as well.
There the SSD wearing out issue would be much higher because the SSD
is under constant write and much bigger swap usage.
I am claiming that *some* SSD would have a higher internal write
amplification factor if doing random 4K write all over the drive, than
random 4K write to a small area of the drive.
I do believe having a different swap out policy controlling preferring
old vs new clusters is beneficial to the data center SSD swap usage
case.
It come downs to:
1) SSD are slow to erase. So most of the SSD performance erases at a
huge erase block size.
2) SSD remaps the logical block address to the internal erase block.
Most of the new data rewritten, regardless of the logical block
address of the SSD drive, grouped together and written to the erase
block.
3) When new data is overridden to the old logical data address, SSD
firmware marks those over-written data as obsolete. The discard
command has the similar effect without introducing new data.
4) When the SSD driver runs out of new erase block, it would need to
GC the old fragmented erased block and pertectial write out of old
data to make room for new erase block. Where the discard command can
be beneficial. It tells the SSD firmware which part of the old data
the GC process can just ignore and skip rewriting.
GC of the obsolete logical blocks is a general hard problem for the SSD.
I am not claiming every SSD has this kind of behavior, but it is
common enough to be worth providing an option.
> > I think it might be useful to provide users an option to choose to
> > write a non full list first. The trade off is more friendly to SSD
> > wear out than preferring to write new blocks. If you keep doing the
> > swap long enough, there will be no new free cluster anyway.
>
> It depends on workloads. Some workloads may demonstrate better spatial
> locality.
Yes, agree that it may happen or may not happen depending on the
workload . The random distribution swap entry is a common pattern we
need to consider as well. The odds are against us. As in the quoted
email where I did the calculation, the odds of getting the whole
cluster free in the random model is very low, 4.4E10-15 even if we are
only using 1/16 swap entries in the swapfile.
Chris
>
> > The example I give in this email:
> >
> > https://lore.kernel.org/linux-mm/CACePvbXGBNC9WzzL4s2uB2UciOkV6nb4bKKkc5TBZP6QuHS_aQ@mail.gmail.com/
> >
> > Chris
> >>
> >> > /*
> >> > @@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> > ci = lock_cluster(si, offset);
> >> > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >> > ci->count = 0;
> >> > + ci->order = 0;
> >> > ci->flags = 0;
> >> > free_cluster(si, ci);
> >> > unlock_cluster(ci);
> >> > @@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >> > INIT_LIST_HEAD(&p->free_clusters);
> >> > INIT_LIST_HEAD(&p->discard_clusters);
> >> >
> >> > + for (i = 0; i < SWAP_NR_ORDERS; i++)
> >> > + INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> >> > +
> >> > for (i = 0; i < swap_header->info.nr_badpages; i++) {
> >> > unsigned int page_nr = swap_header->info.badpages[i];
> >> > if (page_nr == 0 || page_nr > swap_header->info.last_page)
>
> --
> Best Regards,
> Huang, Ying
next prev parent reply other threads:[~2024-08-26 21:26 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-31 6:49 ` [PATCH v5 1/9] mm: swap: swap cluster switch to double link list Chris Li
2024-07-31 6:49 ` [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
[not found] ` <87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-08-16 8:01 ` Chris Li
2024-08-19 8:08 ` Huang, Ying
2024-08-26 21:26 ` Chris Li [this message]
2024-09-09 7:19 ` Huang, Ying
2024-07-31 6:49 ` [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots() Chris Li
2024-07-31 6:49 ` [PATCH v5 4/9] mm: swap: clean up initialization helper chrisl
2024-07-31 6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
2024-08-03 9:11 ` Barry Song
2024-08-03 10:57 ` Barry Song
2024-07-31 6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
2024-08-03 10:38 ` Barry Song
2024-08-03 12:18 ` Kairui Song
2024-08-04 18:06 ` Chris Li
2024-08-05 1:53 ` Barry Song
2024-07-31 6:49 ` [PATCH v5 7/9] mm: swap: add a fragment cluster list chrisl
2024-07-31 6:49 ` [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned chrisl
2024-07-31 6:49 ` [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim chrisl
2024-08-01 9:14 ` [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order David Hildenbrand
2024-08-01 9:59 ` Kairui Song
2024-08-01 10:06 ` Kairui Song
[not found] ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-08-16 7:36 ` Chris Li
2024-08-17 17:47 ` Kairui Song
[not found] ` <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>
[not found] ` <CACePvbXH8b9SOePQ-Ld_UBbcAdJ3gdYtEkReMto5Hbq9WAL7JQ@mail.gmail.com>
[not found] ` <87sevfza3w.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-08-16 7:47 ` Chris Li
2024-08-18 16:59 ` Kairui Song
2024-08-19 8:27 ` Huang, Ying
2024-08-19 8:47 ` Kairui Song
2024-08-19 21:27 ` Chris Li
2024-08-19 8:39 ` Huang, Ying
2024-09-02 1:20 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CACePvbUp1-BsWgYX0hWDVYT+8Q2w_E-0z5up==af_B5KJ7q=VA@mail.gmail.com' \
--to=chrisl@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=hughd@google.com \
--cc=kaleshsingh@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox