From: Daniel Gomez <da.gomez@samsung.com>
To: David Hildenbrand <david@redhat.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Barry Song <v-songbaohua@oppo.com>,
Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>, Luis Chamberlain <mcgrof@kernel.org>,
Pankaj Raghav <p.raghav@samsung.com>
Subject: Swap Min Odrer
Date: Tue, 7 Jan 2025 10:43:47 +0100 [thread overview]
Message-ID: <20250107094347.l37isnk3w2nmpx2i@AALNPWDAGOMEZ1.aal.scsc.local> (raw)
In-Reply-To: <CGME20250107094349eucas1p1c973738624046458bbd8ca980cf6fe33@eucas1p1.samsung.com>
Hi,
High-capacity SSDs require writes to be aligned with the drive's
indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
support swap on these devices, we need to ensure that writes do not
cross IU boundaries. So, I think this may require increasing the minimum
allocation size for swap users.
As a temporary alternative, a proposal [1] to prevent swap on these
devices was previously sent for discussion before LBS was merged
in v6.12 [2]. Additional details and reasoning can be found in [1]
discussion.
[1] https://lore.kernel.org/all/20240627000924.2074949-1-mcgrof@kernel.org/
[2] https://lore.kernel.org/all/20240913-vfs-blocksize-ab40822b2366@brauner/
So, I’d like to bring this up for discussion here and/or propose it as
a topic for the next MM bi-weekly meeting if needed. Please let me know
if this has already been discussed previously. Given that we already
support large folios with mTHP in anon memory and shmem, a similar
approach where we avoid falling back to smaller allocations might
suffice, as it is done in the page cache with min order.
Monitoring writes on a dedicated NVMe with swap enabled with blkalgn
tool [3], I get the following results:
[3] https://github.com/iovisor/bcc/pull/5128
Swap setup:
mkdir -p /mnt/swap
sudo mkfs.xfs -b size=16k /dev/nvme0n1 -f
sudo mount --types xfs /dev/nvme0n1 /mnt/swap
sudo fallocate -l 8192M /mnt/swap/swapfile
sudo chmod 600 /mnt/swap/swapfile
sudo mkswap /mnt/swap/swapfile
sudo swapon /mnt/swap/swapfile
Swap stress test (guest with 7.8Gi of RAM):
stress --vm-bytes 7859M --vm-keep -m 1 --timeout 300
Results:
1. Vanilla v6.12 no mTHP enabled
I/O Alignment Histogram for Device nvme0n1
bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 3255 |****************************************|
8192 -> 16383 : 783 |********* |
16384 -> 32767 : 255 |*** |
32768 -> 65535 : 61 | |
65536 -> 131071 : 24 | |
131072 -> 262143 : 22 | |
262144 -> 524287 : 2136 |************************** |
The above represents the alignment of writes in power-of-2 steps for the
swap dedicated nvme0n1 device. The corresponding granularity for these
alignments is shown in the linear histogram below, where the sector
size is 512 Bytes (e.g. for a sector size 8: 8 << 9: 4096 Bytes). So
the first count indicates that 821 writes where sent with a size of 4
KiB, and the last one shows that 2441 writes where sent with a size of
512 KiB.
I/O Granularity Histogram for Device nvme0n1
Total I/Os: 6536
sector : count distribution
8 : 821 |************* |
16 : 131 |** |
24 : 339 |***** |
32 : 259 |**** |
40 : 114 |* |
48 : 162 |** |
56 : 249 |**** |
64 : 257 |**** |
72 : 157 |** |
80 : 90 |* |
88 : 109 |* |
96 : 188 |*** |
104 : 228 |*** |
112 : 262 |**** |
120 : 81 |* |
128 : 44 | |
136 : 22 | |
144 : 20 | |
152 : 20 | |
160 : 18 | |
168 : 43 | |
176 : 9 | |
184 : 5 | |
192 : 2 | |
200 : 3 | |
208 : 2 | |
216 : 4 | |
224 : 6 | |
232 : 4 | |
240 : 2 | |
248 : 11 | |
256 : 9 | |
264 : 17 | |
272 : 19 | |
280 : 16 | |
288 : 7 | |
296 : 5 | |
304 : 2 | |
312 : 7 | |
320 : 5 | |
328 : 4 | |
336 : 23 | |
344 : 2 | |
352 : 12 | |
360 : 5 | |
368 : 5 | |
376 : 1 | |
384 : 3 | |
392 : 3 | |
400 : 2 | |
408 : 1 | |
416 : 1 | |
424 : 6 | |
432 : 5 | |
440 : 3 | |
448 : 7 | |
456 : 2 | |
472 : 2 | |
480 : 2 | |
488 : 7 | |
496 : 5 | |
504 : 11 | |
520 : 3 | |
528 : 1 | |
536 : 2 | |
544 : 5 | |
560 : 1 | |
568 : 2 | |
576 : 1 | |
584 : 2 | |
592 : 2 | |
600 : 2 | |
608 : 1 | |
616 : 2 | |
624 : 5 | |
632 : 1 | |
640 : 1 | |
648 : 1 | |
656 : 5 | |
664 : 8 | |
672 : 20 | |
680 : 3 | |
688 : 1 | |
704 : 1 | |
712 : 1 | |
720 : 3 | |
728 : 4 | |
736 : 6 | |
744 : 14 | |
752 : 14 | |
760 : 12 | |
768 : 3 | |
776 : 5 | |
784 : 2 | |
792 : 2 | |
800 : 1 | |
808 : 3 | |
816 : 1 | |
824 : 5 | |
832 : 2 | |
840 : 15 | |
848 : 9 | |
856 : 2 | |
864 : 1 | |
872 : 2 | |
880 : 10 | |
888 : 4 | |
896 : 5 | |
904 : 1 | |
920 : 2 | |
936 : 3 | |
944 : 1 | |
952 : 6 | |
960 : 1 | |
968 : 1 | |
976 : 1 | |
984 : 1 | |
992 : 2 | |
1000 : 2 | |
1008 : 16 | |
1016 : 1 | |
1024 : 2441 |****************************************|
2. Vanilla v6.12 with all mTHP enabled:
I/O Alignment Histogram for Device nvme0n1
bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 5076 |****************************************|
8192 -> 16383 : 907 |******* |
16384 -> 32767 : 302 |** |
32768 -> 65535 : 141 |* |
65536 -> 131071 : 46 | |
131072 -> 262143 : 35 | |
262144 -> 524287 : 1993 |*************** |
524288 -> 1048575 : 6 | |
In addition, I've tested and monitored writes enabling SWP_BLKDEV for
regular files to allow large folios for swap files on block devices and
check the difference:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0a9071cfe1d..80a9dbe9645a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3128,6 +3128,7 @@ static int claim_swapfile(struct swap_info_struct *si, struct inode *inode)
si->flags |= SWP_BLKDEV;
} else if (S_ISREG(inode->i_mode)) {
si->bdev = inode->i_sb->s_bdev;
+ si->flags |= SWP_BLKDEV;
}
return 0;
With the following aligment results:
3. v6.12 + SWP_BLKDEV change with mTHP disabled:
I/O Alignment Histogram for Device nvme0n1
bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 146 |***** |
8192 -> 16383 : 23 | |
16384 -> 32767 : 10 | |
32768 -> 65535 : 1 | |
65536 -> 131071 : 3 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 1020 |****************************************|
4. v6.12 + SWP_BLKDEV change with mTHP enabled:
I/O Alignment Histogram for Device nvme0n1
bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 240 |****** |
8192 -> 16383 : 34 | |
16384 -> 32767 : 4 | |
32768 -> 65535 : 0 | |
65536 -> 131071 : 1 | |
131072 -> 262143 : 1 | |
262144 -> 524287 : 1542 |****************************************|
2nd run:
I/O Alignment Histogram for Device nvme0n1
bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 356 |************ |
8192 -> 16383 : 74 |** |
16384 -> 32767 : 58 |** |
32768 -> 65535 : 54 |* |
65536 -> 131071 : 37 |* |
131072 -> 262143 : 11 | |
262144 -> 524287 : 1104 |****************************************|
524288 -> 1048575 : 1 | |
For comparison, the graph below represents a stress test on a drive with
LBS enabled (XFS with 16k block size) with random size writes:
I/O Alignment Histogram for Device nvme0n1
Bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 1758 |* |
1024 -> 2047 : 476 | |
2048 -> 4095 : 164 | |
4096 -> 8191 : 42 | |
8192 -> 16383 : 10 | |
16384 -> 32767 : 3629 |*** |
32768 -> 65535 : 47861 |****************************************|
65536 -> 131071 : 25702 |********************* |
131072 -> 262143 : 10791 |********* |
262144 -> 524287 : 11094 |********* |
524288 -> 1048575 : 55 | |
The test drive here uses a 512 Byte LBA format and so, writes can start
at that boundary. However, LBS/min order allows most of the writes to
fall at 16k bounaries or greater.
What do you think?
Daniel
next parent reply other threads:[~2025-01-07 9:43 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <CGME20250107094349eucas1p1c973738624046458bbd8ca980cf6fe33@eucas1p1.samsung.com>
2025-01-07 9:43 ` Daniel Gomez [this message]
2025-01-07 10:31 ` David Hildenbrand
2025-01-07 12:29 ` Daniel Gomez
2025-01-07 16:41 ` David Hildenbrand
2025-01-08 14:14 ` Daniel Gomez
2025-01-08 20:36 ` David Hildenbrand
2025-01-08 21:19 ` Chris Li
2025-01-08 21:24 ` David Hildenbrand
2025-01-16 8:38 ` Chris Li
2025-01-20 12:02 ` David Hildenbrand
2025-01-09 3:38 ` David Rientjes
2025-01-08 21:09 ` Chris Li
2025-01-08 21:05 ` Chris Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250107094347.l37isnk3w2nmpx2i@AALNPWDAGOMEZ1.aal.scsc.local \
--to=da.gomez@samsung.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-mm@kvack.org \
--cc=mcgrof@kernel.org \
--cc=p.raghav@samsung.com \
--cc=ryan.roberts@arm.com \
--cc=v-songbaohua@oppo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox