* [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
@ 2024-05-24 17:17 Chris Li
2024-05-24 17:17 ` [PATCH 1/2] mm: swap: swap cluster switch to double link list Chris Li
` (4 more replies)
0 siblings, 5 replies; 43+ messages in thread
From: Chris Li @ 2024-05-24 17:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm,
Chris Li, Barry Song
This is the short term solutiolns "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.
When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from new empty cluster list. That works well for PMD size THP,
but it has a serius fragmentation issue reported by Barry.
https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
The mTHP allocation failure rate raises to almost 100% after a few
hours in Barry's test run.
The reason is that all the empty cluster has been exhausted while
there are planty of free swap entries to in the cluster that is
not 100% free.
Address this by remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.
This greatly improve the sucess rate of the mTHP swap allocation.
While I am still waiting for Barry's test result. I paste Kairui's test
result here:
I'm able to reproduce such an issue with a simple script (enabling all order of mthp):
modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024))
swapoff -a
mkswap /dev/ram0
swapon /dev/ram0
rmdir /sys/fs/cgroup/benchmark
mkdir -p /sys/fs/cgroup/benchmark
cd /sys/fs/cgroup/benchmark
echo 8G > memory.max
echo $$ > cgroup.procs
memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary &
/usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \
-P memcache_binary -n allkeys --key-minimum=1 \
--key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \
--ratio 1:0 --pipeline 8 -d 1024
Before:
Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98
After:
Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74
And the fallback ratio dropped by a lot:
Before:
hugepages-32kB/stats/anon_swpout_fallback:15997
hugepages-32kB/stats/anon_swpout:18712
hugepages-512kB/stats/anon_swpout_fallback:192
hugepages-512kB/stats/anon_swpout:0
hugepages-2048kB/stats/anon_swpout_fallback:2
hugepages-2048kB/stats/anon_swpout:0
hugepages-1024kB/stats/anon_swpout_fallback:0
hugepages-1024kB/stats/anon_swpout:0
hugepages-64kB/stats/anon_swpout_fallback:18246
hugepages-64kB/stats/anon_swpout:17644
hugepages-16kB/stats/anon_swpout_fallback:13701
hugepages-16kB/stats/anon_swpout:18234
hugepages-256kB/stats/anon_swpout_fallback:8642
hugepages-256kB/stats/anon_swpout:93
hugepages-128kB/stats/anon_swpout_fallback:21497
hugepages-128kB/stats/anon_swpout:7596
(Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate)
After:
hugepages-32kB/stats/swpout:34445
hugepages-32kB/stats/swpout_fallback:0
hugepages-512kB/stats/swpout:1
hugepages-512kB/stats/swpout_fallback:134
hugepages-2048kB/stats/swpout:1
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:6
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35495
hugepages-64kB/stats/swpout_fallback:0
hugepages-16kB/stats/swpout:32441
hugepages-16kB/stats/swpout_fallback:0
hugepages-256kB/stats/swpout:2223
hugepages-256kB/stats/swpout_fallback:6278
hugepages-128kB/stats/swpout:29136
hugepages-128kB/stats/swpout_fallback:52
Reported-by: Barry Song <21cnbao@gmail.com>
Tested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Chris Li (2):
mm: swap: swap cluster switch to double link list
mm: swap: mTHP allocate swap entries from nonfull list
include/linux/swap.h | 18 ++--
mm/swapfile.c | 252 +++++++++++++++++----------------------------------
2 files changed, 93 insertions(+), 177 deletions(-)
---
base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed
change-id: 20240523-swap-allocator-1534c480ece4
Best regards,
--
Chris Li <chrisl@kernel.org>
^ permalink raw reply [flat|nested] 43+ messages in thread* [PATCH 1/2] mm: swap: swap cluster switch to double link list 2024-05-24 17:17 [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li @ 2024-05-24 17:17 ` Chris Li 2024-05-28 16:23 ` Kairui Song 2024-05-29 8:46 ` Huang, Ying 2024-05-24 17:17 ` [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li ` (3 subsequent siblings) 4 siblings, 2 replies; 43+ messages in thread From: Chris Li @ 2024-05-24 17:17 UTC (permalink / raw) To: Andrew Morton Cc: Kairui Song, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm, Chris Li, Barry Song Previously, the swap cluster used a cluster index as a pointer to construct a custom single link list type "swap_cluster_list". The next cluster pointer is shared with the cluster->count. The assumption is that only the free cluster needs to be put on the list. That assumption is not true for mTHP allocators any more. Need to track the non full cluster on the list as well. Move the current cluster single link list into standard double link list. Remove the cluster getter/setter for accessing the cluster struct member. Move the cluster locking in the caller function rather than the getter/setter function. That way the locking can protect more than one member, e.g. cluster->flag. Change cluster code to use "struct swap_cluster_info *" to reference the cluster rather than by using index. That is more consistent with the list manipulation. It avoids the repeat adding index to the cluser_info. The code is easier to understand. Remove the cluster next pointer is NULL flag, the double link list can handle the empty list pretty well. The "swap_cluster_info" struct is two pointer bigger, because 512 swap entries share one swap struct, it has very little impact on the average memory usage per swap entry. Other than the list conversion, there is no real function change in this patch. --- include/linux/swap.h | 14 ++-- mm/swapfile.c | 231 ++++++++++++++------------------------------------- 2 files changed, 68 insertions(+), 177 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 11c53692f65f..0d3906eff3c9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,11 +254,12 @@ struct swap_cluster_info { * elements correspond to the swap * cluster */ - unsigned int data:24; + unsigned int count:16; unsigned int flags:8; + struct list_head next; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ + /* * The first page in the swap file is the swap header, which is always marked @@ -283,11 +284,6 @@ struct percpu_cluster { unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; -struct swap_cluster_list { - struct swap_cluster_info head; - struct swap_cluster_info tail; -}; - /* * The in-memory structure used to track swap areas. */ @@ -300,7 +296,7 @@ struct swap_info_struct { unsigned int max; /* extent of the swap_map */ unsigned char *swap_map; /* vmalloc'ed array of usage counts */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ - struct swap_cluster_list free_clusters; /* free clusters list */ + struct list_head free_clusters; /* free clusters list */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ @@ -333,7 +329,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ - struct swap_cluster_list discard_clusters; /* discard clusters list */ + struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one * entry per node. diff --git a/mm/swapfile.c b/mm/swapfile.c index 4f0e8b2ac8aa..205a60c5f9cb 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, #endif #define LATENCY_LIMIT 256 -static inline void cluster_set_flag(struct swap_cluster_info *info, - unsigned int flag) -{ - info->flags = flag; -} - -static inline unsigned int cluster_count(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_count(struct swap_cluster_info *info, - unsigned int c) -{ - info->data = c; -} - -static inline void cluster_set_count_flag(struct swap_cluster_info *info, - unsigned int c, unsigned int f) -{ - info->flags = f; - info->data = c; -} - -static inline unsigned int cluster_next(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_next(struct swap_cluster_info *info, - unsigned int n) -{ - info->data = n; -} - -static inline void cluster_set_next_flag(struct swap_cluster_info *info, - unsigned int n, unsigned int f) -{ - info->flags = f; - info->data = n; -} - static inline bool cluster_is_free(struct swap_cluster_info *info) { return info->flags & CLUSTER_FLAG_FREE; } -static inline bool cluster_is_null(struct swap_cluster_info *info) -{ - return info->flags & CLUSTER_FLAG_NEXT_NULL; -} - -static inline void cluster_set_null(struct swap_cluster_info *info) -{ - info->flags = CLUSTER_FLAG_NEXT_NULL; - info->data = 0; -} - static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset) { @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, spin_unlock(&si->lock); } -static inline bool cluster_list_empty(struct swap_cluster_list *list) -{ - return cluster_is_null(&list->head); -} - -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) -{ - return cluster_next(&list->head); -} - -static void cluster_list_init(struct swap_cluster_list *list) -{ - cluster_set_null(&list->head); - cluster_set_null(&list->tail); -} - -static void cluster_list_add_tail(struct swap_cluster_list *list, - struct swap_cluster_info *ci, - unsigned int idx) -{ - if (cluster_list_empty(list)) { - cluster_set_next_flag(&list->head, idx, 0); - cluster_set_next_flag(&list->tail, idx, 0); - } else { - struct swap_cluster_info *ci_tail; - unsigned int tail = cluster_next(&list->tail); - - /* - * Nested cluster lock, but both cluster locks are - * only acquired when we held swap_info_struct->lock - */ - ci_tail = ci + tail; - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); - cluster_set_next(ci_tail, idx); - spin_unlock(&ci_tail->lock); - cluster_set_next_flag(&list->tail, idx, 0); - } -} - -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, - struct swap_cluster_info *ci) -{ - unsigned int idx; - - idx = cluster_next(&list->head); - if (cluster_next(&list->tail) == idx) { - cluster_set_null(&list->head); - cluster_set_null(&list->tail); - } else - cluster_set_next_flag(&list->head, - cluster_next(&ci[idx]), 0); - - return idx; -} - /* Add a cluster to discard list and schedule it to do discard */ static void swap_cluster_schedule_discard(struct swap_info_struct *si, - unsigned int idx) + struct swap_cluster_info *ci) { + unsigned int idx = ci - si->cluster_info; /* * If scan_swap_map_slots() can't find a free cluster, it will check * si->swap_map directly. To make sure the discarding cluster isn't @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); - + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING); + list_add_tail(&ci->next, &si->discard_clusters); + spin_unlock(&ci->lock); schedule_work(&si->discard_work); } -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info; - - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); - cluster_list_add_tail(&si->free_clusters, ci, idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->next, &si->free_clusters); } /* @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) */ static void swap_do_scheduled_discard(struct swap_info_struct *si) { - struct swap_cluster_info *info, *ci; + struct swap_cluster_info *ci; unsigned int idx; - info = si->cluster_info; - - while (!cluster_list_empty(&si->discard_clusters)) { - idx = cluster_list_del_first(&si->discard_clusters, info); + while (!list_empty(&si->discard_clusters)) { + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next); + idx = ci - si->cluster_info; spin_unlock(&si->lock); discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); spin_lock(&si->lock); - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); - __free_cluster(si, idx); + + spin_lock(&ci->lock); + __free_cluster(si, ci); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); unlock_cluster(ci); @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref) complete(&si->comp); } -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) { - struct swap_cluster_info *ci = si->cluster_info; + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); - cluster_list_del_first(&si->free_clusters, ci); - cluster_set_count_flag(ci + idx, 0, 0); + VM_BUG_ON(ci - si->cluster_info != idx); + list_del(&ci->next); + ci->count = 0; + ci->flags = 0; + return ci; } -static void free_cluster(struct swap_info_struct *si, unsigned long idx) +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info + idx; - - VM_BUG_ON(cluster_count(ci) != 0); + VM_BUG_ON(ci->count != 0); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) */ if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == (SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(si, idx); + swap_cluster_schedule_discard(si, ci); return; } - __free_cluster(si, idx); + __free_cluster(si, ci); } /* @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, unsigned long count) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; + struct swap_cluster_info *ci = cluster_info + idx; if (!cluster_info) return; - if (cluster_is_free(&cluster_info[idx])) + if (cluster_is_free(ci)) alloc_cluster(p, idx); - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + count); + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); + ci->count += count; } /* @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, } /* - * The cluster corresponding to page_nr decreases one usage. If the usage - * counter becomes 0, which means no page in the cluster is in using, we can - * optionally discard the cluster and add it to free cluster list. + * The cluster ci decreases one usage. If the usage counter becomes 0, + * which means no page in the cluster is in using, we can optionally discard + * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) { - unsigned long idx = page_nr / SWAPFILE_CLUSTER; - - if (!cluster_info) + if (!p->cluster_info) return; - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) - 1); + VM_BUG_ON(ci->count == 0); + ci->count--; - if (cluster_count(&cluster_info[idx]) == 0) - free_cluster(p, idx); + if (!ci->count) + free_cluster(p, ci); } /* @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, { struct percpu_cluster *percpu_cluster; bool conflict; - + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); offset /= SWAPFILE_CLUSTER; - conflict = !cluster_list_empty(&si->free_clusters) && - offset != cluster_list_first(&si->free_clusters) && + conflict = !list_empty(&si->free_clusters) && + offset != first - si->cluster_info && cluster_is_free(&si->cluster_info[offset]); if (!conflict) @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, cluster = this_cpu_ptr(si->percpu_cluster); tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { - if (!cluster_list_empty(&si->free_clusters)) { - tmp = cluster_next(&si->free_clusters.head) * - SWAPFILE_CLUSTER; - } else if (!cluster_list_empty(&si->discard_clusters)) { + if (!list_empty(&si->free_clusters)) { + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); + list_del(&ci->next); + spin_lock(&ci->lock); + ci->flags = 0; + spin_unlock(&ci->lock); + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in * discarding, do discard now and reclaim them, then @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, goto new_cluster; } else return false; - } + } else + ci = si->cluster_info + tmp; /* * Other CPUs can use our cluster if they can't find a free cluster, @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); - cluster_set_count_flag(ci, 0, 0); - free_cluster(si, idx); + ci->count = 0; + ci->flags = 0; + free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); } @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) count = p->swap_map[offset]; VM_BUG_ON(count != SWAP_HAS_CACHE); p->swap_map[offset] = 0; - dec_cluster_info_page(p, p->cluster_info, offset); + dec_cluster_info_page(p, ci); unlock_cluster(ci); mem_cgroup_uncharge_swap(entry, 1); @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, nr_good_pages = maxpages - 1; /* omit header page */ - cluster_list_init(&p->free_clusters); - cluster_list_init(&p->discard_clusters); + INIT_LIST_HEAD(&p->free_clusters); + INIT_LIST_HEAD(&p->discard_clusters); for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, for (k = 0; k < SWAP_CLUSTER_COLS; k++) { j = (k + col) % SWAP_CLUSTER_COLS; for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { + struct swap_cluster_info *ci; idx = i * SWAP_CLUSTER_COLS + j; + ci = cluster_info + idx; if (idx >= nr_clusters) continue; - if (cluster_count(&cluster_info[idx])) + if (ci->count) continue; - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(&p->free_clusters, cluster_info, - idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->next, &p->free_clusters); } } return nr_extents; -- 2.45.1.288.g0e0cd299f1-goog ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list 2024-05-24 17:17 ` [PATCH 1/2] mm: swap: swap cluster switch to double link list Chris Li @ 2024-05-28 16:23 ` Kairui Song 2024-05-28 22:27 ` Chris Li 2024-05-29 8:46 ` Huang, Ying 1 sibling, 1 reply; 43+ messages in thread From: Kairui Song @ 2024-05-28 16:23 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm, Barry Song On Sat, May 25, 2024 at 1:17 AM Chris Li <chrisl@kernel.org> wrote: > > Previously, the swap cluster used a cluster index as a pointer > to construct a custom single link list type "swap_cluster_list". > The next cluster pointer is shared with the cluster->count. > The assumption is that only the free cluster needs to be put > on the list. > > That assumption is not true for mTHP allocators any more. Need > to track the non full cluster on the list as well. Move the > current cluster single link list into standard double link list. > > Remove the cluster getter/setter for accessing the cluster > struct member. Move the cluster locking in the caller function > rather than the getter/setter function. That way the locking can > protect more than one member, e.g. cluster->flag. > > Change cluster code to use "struct swap_cluster_info *" to > reference the cluster rather than by using index. That is more > consistent with the list manipulation. It avoids the repeat > adding index to the cluser_info. The code is easier to understand. > > Remove the cluster next pointer is NULL flag, the double link > list can handle the empty list pretty well. > > The "swap_cluster_info" struct is two pointer bigger, because > 512 swap entries share one swap struct, it has very little impact > on the average memory usage per swap entry. Other than the list > conversion, there is no real function change in this patch. > --- > include/linux/swap.h | 14 ++-- > mm/swapfile.c | 231 ++++++++++++++------------------------------------- > 2 files changed, 68 insertions(+), 177 deletions(-) > Hi Chris, Thanks for this very nice clean up, the code is much easier to read. > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 11c53692f65f..0d3906eff3c9 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -254,11 +254,12 @@ struct swap_cluster_info { > * elements correspond to the swap > * cluster > */ > - unsigned int data:24; > + unsigned int count:16; > unsigned int flags:8; > + struct list_head next; > }; > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ > + > > /* > * The first page in the swap file is the swap header, which is always marked > @@ -283,11 +284,6 @@ struct percpu_cluster { > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ > }; > > -struct swap_cluster_list { > - struct swap_cluster_info head; > - struct swap_cluster_info tail; > -}; > - > /* > * The in-memory structure used to track swap areas. > */ > @@ -300,7 +296,7 @@ struct swap_info_struct { > unsigned int max; /* extent of the swap_map */ > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > - struct swap_cluster_list free_clusters; /* free clusters list */ > + struct list_head free_clusters; /* free clusters list */ > unsigned int lowest_bit; /* index of first free in swap_map */ > unsigned int highest_bit; /* index of last free in swap_map */ > unsigned int pages; /* total of usable pages of swap */ > @@ -333,7 +329,7 @@ struct swap_info_struct { > * list. > */ > struct work_struct discard_work; /* discard worker */ > - struct swap_cluster_list discard_clusters; /* discard clusters list */ > + struct list_head discard_clusters; /* discard clusters list */ > struct plist_node avail_lists[]; /* > * entries in swap_avail_heads, one > * entry per node. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 4f0e8b2ac8aa..205a60c5f9cb 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, > #endif > #define LATENCY_LIMIT 256 > > -static inline void cluster_set_flag(struct swap_cluster_info *info, > - unsigned int flag) > -{ > - info->flags = flag; > -} > - > -static inline unsigned int cluster_count(struct swap_cluster_info *info) > -{ > - return info->data; > -} > - > -static inline void cluster_set_count(struct swap_cluster_info *info, > - unsigned int c) > -{ > - info->data = c; > -} > - > -static inline void cluster_set_count_flag(struct swap_cluster_info *info, > - unsigned int c, unsigned int f) > -{ > - info->flags = f; > - info->data = c; > -} > - > -static inline unsigned int cluster_next(struct swap_cluster_info *info) > -{ > - return info->data; > -} > - > -static inline void cluster_set_next(struct swap_cluster_info *info, > - unsigned int n) > -{ > - info->data = n; > -} > - > -static inline void cluster_set_next_flag(struct swap_cluster_info *info, > - unsigned int n, unsigned int f) > -{ > - info->flags = f; > - info->data = n; > -} > - > static inline bool cluster_is_free(struct swap_cluster_info *info) > { > return info->flags & CLUSTER_FLAG_FREE; > } > > -static inline bool cluster_is_null(struct swap_cluster_info *info) > -{ > - return info->flags & CLUSTER_FLAG_NEXT_NULL; > -} > - > -static inline void cluster_set_null(struct swap_cluster_info *info) > -{ > - info->flags = CLUSTER_FLAG_NEXT_NULL; > - info->data = 0; > -} > - > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, > unsigned long offset) > { > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, > spin_unlock(&si->lock); > } > > -static inline bool cluster_list_empty(struct swap_cluster_list *list) > -{ > - return cluster_is_null(&list->head); > -} > - > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) > -{ > - return cluster_next(&list->head); > -} > - > -static void cluster_list_init(struct swap_cluster_list *list) > -{ > - cluster_set_null(&list->head); > - cluster_set_null(&list->tail); > -} > - > -static void cluster_list_add_tail(struct swap_cluster_list *list, > - struct swap_cluster_info *ci, > - unsigned int idx) > -{ > - if (cluster_list_empty(list)) { > - cluster_set_next_flag(&list->head, idx, 0); > - cluster_set_next_flag(&list->tail, idx, 0); > - } else { > - struct swap_cluster_info *ci_tail; > - unsigned int tail = cluster_next(&list->tail); > - > - /* > - * Nested cluster lock, but both cluster locks are > - * only acquired when we held swap_info_struct->lock > - */ > - ci_tail = ci + tail; > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); > - cluster_set_next(ci_tail, idx); > - spin_unlock(&ci_tail->lock); > - cluster_set_next_flag(&list->tail, idx, 0); > - } > -} > - > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, > - struct swap_cluster_info *ci) > -{ > - unsigned int idx; > - > - idx = cluster_next(&list->head); > - if (cluster_next(&list->tail) == idx) { > - cluster_set_null(&list->head); > - cluster_set_null(&list->tail); > - } else > - cluster_set_next_flag(&list->head, > - cluster_next(&ci[idx]), 0); > - > - return idx; > -} > - > /* Add a cluster to discard list and schedule it to do discard */ > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > - unsigned int idx) > + struct swap_cluster_info *ci) > { > + unsigned int idx = ci - si->cluster_info; > /* > * If scan_swap_map_slots() can't find a free cluster, it will check > * si->swap_map directly. To make sure the discarding cluster isn't > @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); > - > + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING); > + list_add_tail(&ci->next, &si->discard_clusters); > + spin_unlock(&ci->lock); > schedule_work(&si->discard_work); > } > > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > { > - struct swap_cluster_info *ci = si->cluster_info; > - > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); > - cluster_list_add_tail(&si->free_clusters, ci, idx); > + ci->flags = CLUSTER_FLAG_FREE; > + list_add_tail(&ci->next, &si->free_clusters); > } > > /* > @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > */ > static void swap_do_scheduled_discard(struct swap_info_struct *si) > { > - struct swap_cluster_info *info, *ci; > + struct swap_cluster_info *ci; > unsigned int idx; > > - info = si->cluster_info; > - > - while (!cluster_list_empty(&si->discard_clusters)) { > - idx = cluster_list_del_first(&si->discard_clusters, info); > + while (!list_empty(&si->discard_clusters)) { > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next); > + idx = ci - si->cluster_info; > spin_unlock(&si->lock); > > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, > SWAPFILE_CLUSTER); > > spin_lock(&si->lock); > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); > - __free_cluster(si, idx); > + > + spin_lock(&ci->lock); > + __free_cluster(si, ci); > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > 0, SWAPFILE_CLUSTER); > unlock_cluster(ci); > @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref) > complete(&si->comp); > } > > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) > { > - struct swap_cluster_info *ci = si->cluster_info; > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); > - cluster_list_del_first(&si->free_clusters, ci); > - cluster_set_count_flag(ci + idx, 0, 0); > + VM_BUG_ON(ci - si->cluster_info != idx); > + list_del(&ci->next); > + ci->count = 0; > + ci->flags = 0; > + return ci; > } > > -static void free_cluster(struct swap_info_struct *si, unsigned long idx) > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > { > - struct swap_cluster_info *ci = si->cluster_info + idx; > - > - VM_BUG_ON(cluster_count(ci) != 0); > + VM_BUG_ON(ci->count != 0); > /* > * If the swap is discardable, prepare discard the cluster > * instead of free it immediately. The cluster will be freed > @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) > */ > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == > (SWP_WRITEOK | SWP_PAGE_DISCARD)) { > - swap_cluster_schedule_discard(si, idx); > + swap_cluster_schedule_discard(si, ci); > return; > } > > - __free_cluster(si, idx); > + __free_cluster(si, ci); > } > > /* > @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, > unsigned long count) > { > unsigned long idx = page_nr / SWAPFILE_CLUSTER; > + struct swap_cluster_info *ci = cluster_info + idx; > > if (!cluster_info) > return; > - if (cluster_is_free(&cluster_info[idx])) > + if (cluster_is_free(ci)) > alloc_cluster(p, idx); > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); > - cluster_set_count(&cluster_info[idx], > - cluster_count(&cluster_info[idx]) + count); > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); > + ci->count += count; > } > > /* > @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, > } > > /* > - * The cluster corresponding to page_nr decreases one usage. If the usage > - * counter becomes 0, which means no page in the cluster is in using, we can > - * optionally discard the cluster and add it to free cluster list. > + * The cluster ci decreases one usage. If the usage counter becomes 0, > + * which means no page in the cluster is in using, we can optionally discard > + * the cluster and add it to free cluster list. > */ > -static void dec_cluster_info_page(struct swap_info_struct *p, > - struct swap_cluster_info *cluster_info, unsigned long page_nr) > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) > { > - unsigned long idx = page_nr / SWAPFILE_CLUSTER; > - > - if (!cluster_info) > + if (!p->cluster_info) > return; > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); > - cluster_set_count(&cluster_info[idx], > - cluster_count(&cluster_info[idx]) - 1); > + VM_BUG_ON(ci->count == 0); > + ci->count--; > > - if (cluster_count(&cluster_info[idx]) == 0) > - free_cluster(p, idx); > + if (!ci->count) > + free_cluster(p, ci); > } > > /* > @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > { This whole scan_swap_map_ssd_cluster_conflict function seems not needed now. free_clusters is a double linked list, so using a cluster in the middle won't corrupt the list. The comments are still for the old list design. > struct percpu_cluster *percpu_cluster; > bool conflict; > - > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > offset /= SWAPFILE_CLUSTER; > - conflict = !cluster_list_empty(&si->free_clusters) && > - offset != cluster_list_first(&si->free_clusters) && > + conflict = !list_empty(&si->free_clusters) && > + offset != first - si->cluster_info && > cluster_is_free(&si->cluster_info[offset]); > > if (!conflict) > @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > cluster = this_cpu_ptr(si->percpu_cluster); > tmp = cluster->next[order]; > if (tmp == SWAP_NEXT_INVALID) { > - if (!cluster_list_empty(&si->free_clusters)) { > - tmp = cluster_next(&si->free_clusters.head) * > - SWAPFILE_CLUSTER; > - } else if (!cluster_list_empty(&si->discard_clusters)) { > + if (!list_empty(&si->free_clusters)) { > + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > + list_del(&ci->next); > + spin_lock(&ci->lock); Shouldn't this list_del also be protected by ci->lock? It was protected in alloc_cluster before, keeping the flag synced with cluster status so cluster_is_free won't return false positive. > + ci->flags = 0; > + spin_unlock(&ci->lock); > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > + } else if (!list_empty(&si->discard_clusters)) { > /* > * we don't have free cluster but have some clusters in > * discarding, do discard now and reclaim them, then > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > goto new_cluster; > } else > return false; > - } > + } else > + ci = si->cluster_info + tmp; This "else ci = ..." seems wrong, tmp is not an array index, and not needed either. > > /* > * Other CPUs can use our cluster if they can't find a free cluster, > @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > > ci = lock_cluster(si, offset); > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > - cluster_set_count_flag(ci, 0, 0); > - free_cluster(si, idx); > + ci->count = 0; > + ci->flags = 0; > + free_cluster(si, ci); > unlock_cluster(ci); > swap_range_free(si, offset, SWAPFILE_CLUSTER); > } > @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > count = p->swap_map[offset]; > VM_BUG_ON(count != SWAP_HAS_CACHE); > p->swap_map[offset] = 0; > - dec_cluster_info_page(p, p->cluster_info, offset); > + dec_cluster_info_page(p, ci); > unlock_cluster(ci); > > mem_cgroup_uncharge_swap(entry, 1); > @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > > nr_good_pages = maxpages - 1; /* omit header page */ > > - cluster_list_init(&p->free_clusters); > - cluster_list_init(&p->discard_clusters); > + INIT_LIST_HEAD(&p->free_clusters); > + INIT_LIST_HEAD(&p->discard_clusters); > > for (i = 0; i < swap_header->info.nr_badpages; i++) { > unsigned int page_nr = swap_header->info.badpages[i]; > @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > for (k = 0; k < SWAP_CLUSTER_COLS; k++) { > j = (k + col) % SWAP_CLUSTER_COLS; > for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { > + struct swap_cluster_info *ci; > idx = i * SWAP_CLUSTER_COLS + j; > + ci = cluster_info + idx; > if (idx >= nr_clusters) > continue; > - if (cluster_count(&cluster_info[idx])) > + if (ci->count) > continue; > - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); > - cluster_list_add_tail(&p->free_clusters, cluster_info, > - idx); > + ci->flags = CLUSTER_FLAG_FREE; > + list_add_tail(&ci->next, &p->free_clusters); > } > } > return nr_extents; > > -- > 2.45.1.288.g0e0cd299f1-goog > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list 2024-05-28 16:23 ` Kairui Song @ 2024-05-28 22:27 ` Chris Li 2024-05-29 0:50 ` Chris Li 0 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-05-28 22:27 UTC (permalink / raw) To: Kairui Song Cc: Andrew Morton, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm, Barry Song Hi Kairui, On Tue, May 28, 2024 at 9:24 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Sat, May 25, 2024 at 1:17 AM Chris Li <chrisl@kernel.org> wrote: > > > > Previously, the swap cluster used a cluster index as a pointer > > to construct a custom single link list type "swap_cluster_list". > > The next cluster pointer is shared with the cluster->count. > > The assumption is that only the free cluster needs to be put > > on the list. > > > > That assumption is not true for mTHP allocators any more. Need > > to track the non full cluster on the list as well. Move the > > current cluster single link list into standard double link list. > > > > Remove the cluster getter/setter for accessing the cluster > > struct member. Move the cluster locking in the caller function > > rather than the getter/setter function. That way the locking can > > protect more than one member, e.g. cluster->flag. > > > > Change cluster code to use "struct swap_cluster_info *" to > > reference the cluster rather than by using index. That is more > > consistent with the list manipulation. It avoids the repeat > > adding index to the cluser_info. The code is easier to understand. > > > > Remove the cluster next pointer is NULL flag, the double link > > list can handle the empty list pretty well. > > > > The "swap_cluster_info" struct is two pointer bigger, because > > 512 swap entries share one swap struct, it has very little impact > > on the average memory usage per swap entry. Other than the list > > conversion, there is no real function change in this patch. > > --- > > include/linux/swap.h | 14 ++-- > > mm/swapfile.c | 231 ++++++++++++++------------------------------------- > > 2 files changed, 68 insertions(+), 177 deletions(-) > > > > Hi Chris, > > Thanks for this very nice clean up, the code is much easier to read. Thanks for the review. See my comments below. I am working on a V2 to address the two issues identified so far. BTW, I am pretty happy the patch stats have much more deltes than insert. > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 11c53692f65f..0d3906eff3c9 100644 > > --- a/include/linux/swap.hm > > +++ b/include/linux/swap.h > > @@ -254,11 +254,12 @@ struct swap_cluster_info { > > * elements correspond to the swap > > * cluster > > */ > > - unsigned int data:24; > > + unsigned int count:16; > > unsigned int flags:8; > > + struct list_head next; > > }; > > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ > > + > > > > /* > > * The first page in the swap file is the swap header, which is always marked > > @@ -283,11 +284,6 @@ struct percpu_cluster { > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ > > }; > > > > -struct swap_cluster_list { > > - struct swap_cluster_info head; > > - struct swap_cluster_info tail; > > -}; > > - > > /* > > * The in-memory structure used to track swap areas. > > */ > > @@ -300,7 +296,7 @@ struct swap_info_struct { > > unsigned int max; /* extent of the swap_map */ > > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > > - struct swap_cluster_list free_clusters; /* free clusters list */ > > + struct list_head free_clusters; /* free clusters list */ > > unsigned int lowest_bit; /* index of first free in swap_map */ > > unsigned int highest_bit; /* index of last free in swap_map */ > > unsigned int pages; /* total of usable pages of swap */ > > @@ -333,7 +329,7 @@ struct swap_info_struct { > > * list. > > */ > > struct work_struct discard_work; /* discard worker */ > > - struct swap_cluster_list discard_clusters; /* discard clusters list */ > > + struct list_head discard_clusters; /* discard clusters list */ > > struct plist_node avail_lists[]; /* > > * entries in swap_avail_heads, one > > * entry per node. > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 4f0e8b2ac8aa..205a60c5f9cb 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, > > #endif > > #define LATENCY_LIMIT 256 > > > > -static inline void cluster_set_flag(struct swap_cluster_info *info, > > - unsigned int flag) > > -{ > > - info->flags = flag; > > -} > > - > > -static inline unsigned int cluster_count(struct swap_cluster_info *info) > > -{ > > - return info->data; > > -} > > - > > -static inline void cluster_set_count(struct swap_cluster_info *info, > > - unsigned int c) > > -{ > > - info->data = c; > > -} > > - > > -static inline void cluster_set_count_flag(struct swap_cluster_info *info, > > - unsigned int c, unsigned int f) > > -{ > > - info->flags = f; > > - info->data = c; > > -} > > - > > -static inline unsigned int cluster_next(struct swap_cluster_info *info) > > -{ > > - return info->data; > > -} > > - > > -static inline void cluster_set_next(struct swap_cluster_info *info, > > - unsigned int n) > > -{ > > - info->data = n; > > -} > > - > > -static inline void cluster_set_next_flag(struct swap_cluster_info *info, > > - unsigned int n, unsigned int f) > > -{ > > - info->flags = f; > > - info->data = n; > > -} > > - > > static inline bool cluster_is_free(struct swap_cluster_info *info) > > { > > return info->flags & CLUSTER_FLAG_FREE; > > } > > > > -static inline bool cluster_is_null(struct swap_cluster_info *info) > > -{ > > - return info->flags & CLUSTER_FLAG_NEXT_NULL; > > -} > > - > > -static inline void cluster_set_null(struct swap_cluster_info *info) > > -{ > > - info->flags = CLUSTER_FLAG_NEXT_NULL; > > - info->data = 0; > > -} > > - > > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, > > unsigned long offset) > > { > > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, > > spin_unlock(&si->lock); > > } > > > > -static inline bool cluster_list_empty(struct swap_cluster_list *list) > > -{ > > - return cluster_is_null(&list->head); > > -} > > - > > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) > > -{ > > - return cluster_next(&list->head); > > -} > > - > > -static void cluster_list_init(struct swap_cluster_list *list) > > -{ > > - cluster_set_null(&list->head); > > - cluster_set_null(&list->tail); > > -} > > - > > -static void cluster_list_add_tail(struct swap_cluster_list *list, > > - struct swap_cluster_info *ci, > > - unsigned int idx) > > -{ > > - if (cluster_list_empty(list)) { > > - cluster_set_next_flag(&list->head, idx, 0); > > - cluster_set_next_flag(&list->tail, idx, 0); > > - } else { > > - struct swap_cluster_info *ci_tail; > > - unsigned int tail = cluster_next(&list->tail); > > - > > - /* > > - * Nested cluster lock, but both cluster locks are > > - * only acquired when we held swap_info_struct->lock > > - */ > > - ci_tail = ci + tail; > > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); > > - cluster_set_next(ci_tail, idx); > > - spin_unlock(&ci_tail->lock); > > - cluster_set_next_flag(&list->tail, idx, 0); > > - } > > -} > > - > > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, > > - struct swap_cluster_info *ci) > > -{ > > - unsigned int idx; > > - > > - idx = cluster_next(&list->head); > > - if (cluster_next(&list->tail) == idx) { > > - cluster_set_null(&list->head); > > - cluster_set_null(&list->tail); > > - } else > > - cluster_set_next_flag(&list->head, > > - cluster_next(&ci[idx]), 0); > > - > > - return idx; > > -} > > - > > /* Add a cluster to discard list and schedule it to do discard */ > > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > - unsigned int idx) > > + struct swap_cluster_info *ci) > > { > > + unsigned int idx = ci - si->cluster_info; > > /* > > * If scan_swap_map_slots() can't find a free cluster, it will check > > * si->swap_map directly. To make sure the discarding cluster isn't > > @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > > > > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); > > - > > + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING); > > + list_add_tail(&ci->next, &si->discard_clusters); > > + spin_unlock(&ci->lock); > > schedule_work(&si->discard_work); > > } > > > > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > > { > > - struct swap_cluster_info *ci = si->cluster_info; > > - > > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); > > - cluster_list_add_tail(&si->free_clusters, ci, idx); > > + ci->flags = CLUSTER_FLAG_FREE; > > + list_add_tail(&ci->next, &si->free_clusters); > > } > > > > /* > > @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > > */ > > static void swap_do_scheduled_discard(struct swap_info_struct *si) > > { > > - struct swap_cluster_info *info, *ci; > > + struct swap_cluster_info *ci; > > unsigned int idx; > > > > - info = si->cluster_info; > > - > > - while (!cluster_list_empty(&si->discard_clusters)) { > > - idx = cluster_list_del_first(&si->discard_clusters, info); > > + while (!list_empty(&si->discard_clusters)) { > > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next); > > + idx = ci - si->cluster_info; > > spin_unlock(&si->lock); > > > > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, > > SWAPFILE_CLUSTER); > > > > spin_lock(&si->lock); > > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); > > - __free_cluster(si, idx); > > + > > + spin_lock(&ci->lock); > > + __free_cluster(si, ci); > > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > > 0, SWAPFILE_CLUSTER); > > unlock_cluster(ci); > > @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref) > > complete(&si->comp); > > } > > > > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) > > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) > > { > > - struct swap_cluster_info *ci = si->cluster_info; > > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > > > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); > > - cluster_list_del_first(&si->free_clusters, ci); > > - cluster_set_count_flag(ci + idx, 0, 0); > > + VM_BUG_ON(ci - si->cluster_info != idx); > > + list_del(&ci->next); > > + ci->count = 0; > > + ci->flags = 0; > > + return ci; > > } > > > > -static void free_cluster(struct swap_info_struct *si, unsigned long idx) > > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > > { > > - struct swap_cluster_info *ci = si->cluster_info + idx; > > - > > - VM_BUG_ON(cluster_count(ci) != 0); > > + VM_BUG_ON(ci->count != 0); > > /* > > * If the swap is discardable, prepare discard the cluster > > * instead of free it immediately. The cluster will be freed > > @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) > > */ > > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == > > (SWP_WRITEOK | SWP_PAGE_DISCARD)) { > > - swap_cluster_schedule_discard(si, idx); > > + swap_cluster_schedule_discard(si, ci); > > return; > > } > > > > - __free_cluster(si, idx); > > + __free_cluster(si, ci); > > } > > > > /* > > @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, > > unsigned long count) > > { > > unsigned long idx = page_nr / SWAPFILE_CLUSTER; > > + struct swap_cluster_info *ci = cluster_info + idx; > > > > if (!cluster_info) > > return; > > - if (cluster_is_free(&cluster_info[idx])) > > + if (cluster_is_free(ci)) > > alloc_cluster(p, idx); > > > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); > > - cluster_set_count(&cluster_info[idx], > > - cluster_count(&cluster_info[idx]) + count); > > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); > > + ci->count += count; > > } > > > > /* > > @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, > > } > > > > /* > > - * The cluster corresponding to page_nr decreases one usage. If the usage > > - * counter becomes 0, which means no page in the cluster is in using, we can > > - * optionally discard the cluster and add it to free cluster list. > > + * The cluster ci decreases one usage. If the usage counter becomes 0, > > + * which means no page in the cluster is in using, we can optionally discard > > + * the cluster and add it to free cluster list. > > */ > > -static void dec_cluster_info_page(struct swap_info_struct *p, > > - struct swap_cluster_info *cluster_info, unsigned long page_nr) > > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) > > { > > - unsigned long idx = page_nr / SWAPFILE_CLUSTER; > > - > > - if (!cluster_info) > > + if (!p->cluster_info) > > return; > > > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); > > - cluster_set_count(&cluster_info[idx], > > - cluster_count(&cluster_info[idx]) - 1); > > + VM_BUG_ON(ci->count == 0); > > + ci->count--; > > > > - if (cluster_count(&cluster_info[idx]) == 0) > > - free_cluster(p, idx); > > + if (!ci->count) > > + free_cluster(p, ci); > > } > > > > /* > > @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > > { > > This whole scan_swap_map_ssd_cluster_conflict function seems not > needed now. free_clusters is a double linked list, so using a cluster > in the middle won't corrupt the list. The comments are still for the > old list design. I was debating removing the cluster_conflict() as well and found out it can't be removed until we change the order 0 allocations also use clusters. There can still be conflict because the order 0 allocations just do the bruce force scan of swap_map[] when try_ssd fails. This causes other problems as well. As far as I can tell, the conflict can still happen. > > > struct percpu_cluster *percpu_cluster; > > bool conflict; > > - > > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > offset /= SWAPFILE_CLUSTER; > > - conflict = !cluster_list_empty(&si->free_clusters) && > > - offset != cluster_list_first(&si->free_clusters) && > > + conflict = !list_empty(&si->free_clusters) && > > + offset != first - si->cluster_info && > > cluster_is_free(&si->cluster_info[offset]); > > > > if (!conflict) > > @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > cluster = this_cpu_ptr(si->percpu_cluster); > > tmp = cluster->next[order]; > > if (tmp == SWAP_NEXT_INVALID) { > > - if (!cluster_list_empty(&si->free_clusters)) { > > - tmp = cluster_next(&si->free_clusters.head) * > > - SWAPFILE_CLUSTER; > > - } else if (!cluster_list_empty(&si->discard_clusters)) { > > + if (!list_empty(&si->free_clusters)) { > > + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > + list_del(&ci->next); > > + spin_lock(&ci->lock); > > Shouldn't this list_del also be protected by ci->lock? It was > protected in alloc_cluster before, keeping the flag synced with > cluster status so cluster_is_free won't return false positive. The list add and list del are protected by Si->lock not by cluster lock. Previously I wanted to use cluster->lock to protect it and realized that adding/deleting the cluster to/from the list will change three clusters. (current, prev, next). We need to get three cluster locks. We might change to a per list spinlock. e.g. one lock for one list to reduce the contention on Si->lock. However, per cluster lock is not enough if we only take one cluster lock. > > > + ci->flags = 0; > > + spin_unlock(&ci->lock); > > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > + } else if (!list_empty(&si->discard_clusters)) { > > /* > > * we don't have free cluster but have some clusters in > > * discarding, do discard now and reclaim them, then > > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > goto new_cluster; > > } else > > return false; > > - } > > + } else > > + ci = si->cluster_info + tmp; > > This "else ci = ..." seems wrong, tmp is not an array index, and not > needed either. Yes, there is a bug there, pointed out by OPPO as well. It should be ci = si->cluster_info + (tmp/ SWAPFILE_CLUSTER); "tmp" is needed because "tmp" or " cluster->next[order]" keep track of the current cluster allocation offset, in the per cpu cluster struct. BTW, In my V2 I have changed "tmp" to "offset" and previous "offset" to "retoffset" to make it more obvious. "tmp" does not give much information about what it really does. Chris > > > > > /* > > * Other CPUs can use our cluster if they can't find a free cluster, > > @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > > > > ci = lock_cluster(si, offset); > > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > > - cluster_set_count_flag(ci, 0, 0); > > - free_cluster(si, idx); > > + ci->count = 0; > > + ci->flags = 0; > > + free_cluster(si, ci); > > unlock_cluster(ci); > > swap_range_free(si, offset, SWAPFILE_CLUSTER); > > } > > @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > > count = p->swap_map[offset]; > > VM_BUG_ON(count != SWAP_HAS_CACHE); > > p->swap_map[offset] = 0; > > - dec_cluster_info_page(p, p->cluster_info, offset); > > + dec_cluster_info_page(p, ci); > > unlock_cluster(ci); > > > > mem_cgroup_uncharge_swap(entry, 1); > > @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > > > > nr_good_pages = maxpages - 1; /* omit header page */ > > > > - cluster_list_init(&p->free_clusters); > > - cluster_list_init(&p->discard_clusters); > > + INIT_LIST_HEAD(&p->free_clusters); > > + INIT_LIST_HEAD(&p->discard_clusters); > > > > for (i = 0; i < swap_header->info.nr_badpages; i++) { > > unsigned int page_nr = swap_header->info.badpages[i]; > > @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > > for (k = 0; k < SWAP_CLUSTER_COLS; k++) { > > j = (k + col) % SWAP_CLUSTER_COLS; > > for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { > > + struct swap_cluster_info *ci; > > idx = i * SWAP_CLUSTER_COLS + j; > > + ci = cluster_info + idx; > > if (idx >= nr_clusters) > > continue; > > - if (cluster_count(&cluster_info[idx])) > > + if (ci->count) > > continue; > > - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); > > - cluster_list_add_tail(&p->free_clusters, cluster_info, > > - idx); > > + ci->flags = CLUSTER_FLAG_FREE; > > + list_add_tail(&ci->next, &p->free_clusters); > > } > > } > > return nr_extents; > > > > -- > > 2.45.1.288.g0e0cd299f1-goog > > > > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list 2024-05-28 22:27 ` Chris Li @ 2024-05-29 0:50 ` Chris Li 0 siblings, 0 replies; 43+ messages in thread From: Chris Li @ 2024-05-29 0:50 UTC (permalink / raw) To: Kairui Song Cc: Andrew Morton, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm, Barry Song On Tue, May 28, 2024 at 3:27 PM Chris Li <chriscli@google.com> wrote: > > > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > > goto new_cluster; > > > } else > > > return false; > > > - } > > > + } else > > > + ci = si->cluster_info + tmp; > > > > This "else ci = ..." seems wrong, tmp is not an array index, and not > > needed either. > > Yes, there is a bug there, pointed out by OPPO as well. It should be > ci = si->cluster_info + (tmp/ SWAPFILE_CLUSTER); > > "tmp" is needed because "tmp" or " cluster->next[order]" keep track of > the current cluster allocation offset, > in the per cpu cluster struct. Hi Kairui, Actually, you are right, the "ci" is not used here. That is why that ci out of bound error does not trigger kernel OOPS. We can delete that else line completely. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list 2024-05-24 17:17 ` [PATCH 1/2] mm: swap: swap cluster switch to double link list Chris Li 2024-05-28 16:23 ` Kairui Song @ 2024-05-29 8:46 ` Huang, Ying 2024-05-30 21:49 ` Chris Li 1 sibling, 1 reply; 43+ messages in thread From: Huang, Ying @ 2024-05-29 8:46 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > Previously, the swap cluster used a cluster index as a pointer > to construct a custom single link list type "swap_cluster_list". > The next cluster pointer is shared with the cluster->count. > The assumption is that only the free cluster needs to be put > on the list. > > That assumption is not true for mTHP allocators any more. I think that the words aren't correct here. mTHP swap entry allocators can work with current cluster definition. > Need to track the non full cluster on the list as well. Move the > current cluster single link list into standard double link list. It's an optimization to track non-full cluster with a list. I understand that you want to change cluster list definition. I just feel the wording isn't accurate. > Remove the cluster getter/setter for accessing the cluster > struct member. Move the cluster locking in the caller function > rather than the getter/setter function. That way the locking can > protect more than one member, e.g. cluster->flag. Sorry, I don't understand the locking in above words. I don't find that we lock/unlock in the original getter/setter functions. I found that the cluster locking rule for cluster list is changed. Better to make this explicit. > Change cluster code to use "struct swap_cluster_info *" to > reference the cluster rather than by using index. That is more > consistent with the list manipulation. It avoids the repeat > adding index to the cluser_info. The code is easier to understand. > > Remove the cluster next pointer is NULL flag, the double link > list can handle the empty list pretty well. > > The "swap_cluster_info" struct is two pointer bigger, because > 512 swap entries share one swap struct, it has very little impact > on the average memory usage per swap entry. Other than the list > conversion, there is no real function change in this patch. On 64bit platform, the size of swap_cluster_info increases from 8 bytes to 24 bytes. For a 1TB swap device, the memory usage will increase from 4MB to 12MB. This looks OK for me. Another choice is to use a customized double linked list using "unsigned int" as pointer to cluster. That will reduce the size of cluster to 16 bytes. But it may be not necessary to do that. Anyway, I think that it's better to add more calculation in changelog for memory usage increment. > --- > include/linux/swap.h | 14 ++-- > mm/swapfile.c | 231 ++++++++++++++------------------------------------- > 2 files changed, 68 insertions(+), 177 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 11c53692f65f..0d3906eff3c9 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -254,11 +254,12 @@ struct swap_cluster_info { > * elements correspond to the swap > * cluster > */ > - unsigned int data:24; > + unsigned int count:16; > unsigned int flags:8; If we use 16bits and 8 bits in bit fields, why not just use u8 and u16 instead? > + struct list_head next; "next" isn't a good naming because prev pointer is in list_head too. The common naming is "list". Need to revise comments for swap_cluster_info.lock and add the locking rule comments for swap_cluster_info.next. > }; > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ > + > > /* > * The first page in the swap file is the swap header, which is always marked > @@ -283,11 +284,6 @@ struct percpu_cluster { > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ > }; > > -struct swap_cluster_list { > - struct swap_cluster_info head; > - struct swap_cluster_info tail; > -}; > - > /* > * The in-memory structure used to track swap areas. > */ > @@ -300,7 +296,7 @@ struct swap_info_struct { > unsigned int max; /* extent of the swap_map */ > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > - struct swap_cluster_list free_clusters; /* free clusters list */ > + struct list_head free_clusters; /* free clusters list */ > unsigned int lowest_bit; /* index of first free in swap_map */ > unsigned int highest_bit; /* index of last free in swap_map */ > unsigned int pages; /* total of usable pages of swap */ > @@ -333,7 +329,7 @@ struct swap_info_struct { > * list. > */ > struct work_struct discard_work; /* discard worker */ > - struct swap_cluster_list discard_clusters; /* discard clusters list */ > + struct list_head discard_clusters; /* discard clusters list */ > struct plist_node avail_lists[]; /* > * entries in swap_avail_heads, one > * entry per node. > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 4f0e8b2ac8aa..205a60c5f9cb 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, > #endif > #define LATENCY_LIMIT 256 > > -static inline void cluster_set_flag(struct swap_cluster_info *info, > - unsigned int flag) > -{ > - info->flags = flag; > -} > - > -static inline unsigned int cluster_count(struct swap_cluster_info *info) > -{ > - return info->data; > -} > - > -static inline void cluster_set_count(struct swap_cluster_info *info, > - unsigned int c) > -{ > - info->data = c; > -} > - > -static inline void cluster_set_count_flag(struct swap_cluster_info *info, > - unsigned int c, unsigned int f) > -{ > - info->flags = f; > - info->data = c; > -} > - > -static inline unsigned int cluster_next(struct swap_cluster_info *info) > -{ > - return info->data; > -} > - > -static inline void cluster_set_next(struct swap_cluster_info *info, > - unsigned int n) > -{ > - info->data = n; > -} > - > -static inline void cluster_set_next_flag(struct swap_cluster_info *info, > - unsigned int n, unsigned int f) > -{ > - info->flags = f; > - info->data = n; > -} > - > static inline bool cluster_is_free(struct swap_cluster_info *info) > { > return info->flags & CLUSTER_FLAG_FREE; > } > > -static inline bool cluster_is_null(struct swap_cluster_info *info) > -{ > - return info->flags & CLUSTER_FLAG_NEXT_NULL; > -} > - > -static inline void cluster_set_null(struct swap_cluster_info *info) > -{ > - info->flags = CLUSTER_FLAG_NEXT_NULL; > - info->data = 0; > -} > - > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, > unsigned long offset) > { > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, > spin_unlock(&si->lock); > } > > -static inline bool cluster_list_empty(struct swap_cluster_list *list) > -{ > - return cluster_is_null(&list->head); > -} > - > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) > -{ > - return cluster_next(&list->head); > -} > - > -static void cluster_list_init(struct swap_cluster_list *list) > -{ > - cluster_set_null(&list->head); > - cluster_set_null(&list->tail); > -} > - > -static void cluster_list_add_tail(struct swap_cluster_list *list, > - struct swap_cluster_info *ci, > - unsigned int idx) > -{ > - if (cluster_list_empty(list)) { > - cluster_set_next_flag(&list->head, idx, 0); > - cluster_set_next_flag(&list->tail, idx, 0); > - } else { > - struct swap_cluster_info *ci_tail; > - unsigned int tail = cluster_next(&list->tail); > - > - /* > - * Nested cluster lock, but both cluster locks are > - * only acquired when we held swap_info_struct->lock > - */ > - ci_tail = ci + tail; > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); > - cluster_set_next(ci_tail, idx); > - spin_unlock(&ci_tail->lock); > - cluster_set_next_flag(&list->tail, idx, 0); > - } > -} > - > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, > - struct swap_cluster_info *ci) > -{ > - unsigned int idx; > - > - idx = cluster_next(&list->head); > - if (cluster_next(&list->tail) == idx) { > - cluster_set_null(&list->head); > - cluster_set_null(&list->tail); > - } else > - cluster_set_next_flag(&list->head, > - cluster_next(&ci[idx]), 0); > - > - return idx; > -} > - > /* Add a cluster to discard list and schedule it to do discard */ > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > - unsigned int idx) > + struct swap_cluster_info *ci) > { > + unsigned int idx = ci - si->cluster_info; > /* > * If scan_swap_map_slots() can't find a free cluster, it will check > * si->swap_map directly. To make sure the discarding cluster isn't > @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); > - > + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING); If we don't use ci->lock to protect ci->next, we don't need spin_lock here. > + list_add_tail(&ci->next, &si->discard_clusters); > + spin_unlock(&ci->lock); > schedule_work(&si->discard_work); > } > > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > { > - struct swap_cluster_info *ci = si->cluster_info; > - > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); > - cluster_list_add_tail(&si->free_clusters, ci, idx); > + ci->flags = CLUSTER_FLAG_FREE; > + list_add_tail(&ci->next, &si->free_clusters); > } > > /* > @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > */ > static void swap_do_scheduled_discard(struct swap_info_struct *si) > { > - struct swap_cluster_info *info, *ci; > + struct swap_cluster_info *ci; > unsigned int idx; > > - info = si->cluster_info; > - > - while (!cluster_list_empty(&si->discard_clusters)) { > - idx = cluster_list_del_first(&si->discard_clusters, info); > + while (!list_empty(&si->discard_clusters)) { > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next); > + idx = ci - si->cluster_info; > spin_unlock(&si->lock); > > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, > SWAPFILE_CLUSTER); > > spin_lock(&si->lock); > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); > - __free_cluster(si, idx); > + > + spin_lock(&ci->lock); > + __free_cluster(si, ci); > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > 0, SWAPFILE_CLUSTER); > unlock_cluster(ci); > @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref) > complete(&si->comp); > } > > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) > { > - struct swap_cluster_info *ci = si->cluster_info; > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); > - cluster_list_del_first(&si->free_clusters, ci); > - cluster_set_count_flag(ci + idx, 0, 0); > + VM_BUG_ON(ci - si->cluster_info != idx); > + list_del(&ci->next); > + ci->count = 0; > + ci->flags = 0; > + return ci; > } > > -static void free_cluster(struct swap_info_struct *si, unsigned long idx) > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > { > - struct swap_cluster_info *ci = si->cluster_info + idx; > - > - VM_BUG_ON(cluster_count(ci) != 0); > + VM_BUG_ON(ci->count != 0); > /* > * If the swap is discardable, prepare discard the cluster > * instead of free it immediately. The cluster will be freed > @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) > */ > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == > (SWP_WRITEOK | SWP_PAGE_DISCARD)) { > - swap_cluster_schedule_discard(si, idx); > + swap_cluster_schedule_discard(si, ci); > return; > } > > - __free_cluster(si, idx); > + __free_cluster(si, ci); > } > > /* > @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, > unsigned long count) > { > unsigned long idx = page_nr / SWAPFILE_CLUSTER; > + struct swap_cluster_info *ci = cluster_info + idx; > > if (!cluster_info) > return; > - if (cluster_is_free(&cluster_info[idx])) > + if (cluster_is_free(ci)) > alloc_cluster(p, idx); > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); > - cluster_set_count(&cluster_info[idx], > - cluster_count(&cluster_info[idx]) + count); > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); > + ci->count += count; > } > > /* > @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, > } > > /* > - * The cluster corresponding to page_nr decreases one usage. If the usage > - * counter becomes 0, which means no page in the cluster is in using, we can > - * optionally discard the cluster and add it to free cluster list. > + * The cluster ci decreases one usage. If the usage counter becomes 0, > + * which means no page in the cluster is in using, we can optionally discard > + * the cluster and add it to free cluster list. > */ > -static void dec_cluster_info_page(struct swap_info_struct *p, > - struct swap_cluster_info *cluster_info, unsigned long page_nr) > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) > { > - unsigned long idx = page_nr / SWAPFILE_CLUSTER; > - > - if (!cluster_info) > + if (!p->cluster_info) > return; > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); > - cluster_set_count(&cluster_info[idx], > - cluster_count(&cluster_info[idx]) - 1); > + VM_BUG_ON(ci->count == 0); > + ci->count--; > > - if (cluster_count(&cluster_info[idx]) == 0) > - free_cluster(p, idx); > + if (!ci->count) > + free_cluster(p, ci); > } > > /* > @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > { > struct percpu_cluster *percpu_cluster; > bool conflict; > - > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > offset /= SWAPFILE_CLUSTER; > - conflict = !cluster_list_empty(&si->free_clusters) && > - offset != cluster_list_first(&si->free_clusters) && > + conflict = !list_empty(&si->free_clusters) && > + offset != first - si->cluster_info && > cluster_is_free(&si->cluster_info[offset]); > > if (!conflict) > @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > cluster = this_cpu_ptr(si->percpu_cluster); > tmp = cluster->next[order]; > if (tmp == SWAP_NEXT_INVALID) { > - if (!cluster_list_empty(&si->free_clusters)) { > - tmp = cluster_next(&si->free_clusters.head) * > - SWAPFILE_CLUSTER; > - } else if (!cluster_list_empty(&si->discard_clusters)) { > + if (!list_empty(&si->free_clusters)) { > + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > + list_del(&ci->next); > + spin_lock(&ci->lock); > + ci->flags = 0; > + spin_unlock(&ci->lock); > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > + } else if (!list_empty(&si->discard_clusters)) { > /* > * we don't have free cluster but have some clusters in > * discarding, do discard now and reclaim them, then > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > goto new_cluster; > } else > return false; > - } > + } else > + ci = si->cluster_info + tmp; > > /* > * Other CPUs can use our cluster if they can't find a free cluster, > @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > > ci = lock_cluster(si, offset); > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > - cluster_set_count_flag(ci, 0, 0); > - free_cluster(si, idx); > + ci->count = 0; > + ci->flags = 0; > + free_cluster(si, ci); > unlock_cluster(ci); > swap_range_free(si, offset, SWAPFILE_CLUSTER); > } > @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > count = p->swap_map[offset]; > VM_BUG_ON(count != SWAP_HAS_CACHE); > p->swap_map[offset] = 0; > - dec_cluster_info_page(p, p->cluster_info, offset); > + dec_cluster_info_page(p, ci); > unlock_cluster(ci); > > mem_cgroup_uncharge_swap(entry, 1); > @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > > nr_good_pages = maxpages - 1; /* omit header page */ > > - cluster_list_init(&p->free_clusters); > - cluster_list_init(&p->discard_clusters); > + INIT_LIST_HEAD(&p->free_clusters); > + INIT_LIST_HEAD(&p->discard_clusters); > > for (i = 0; i < swap_header->info.nr_badpages; i++) { > unsigned int page_nr = swap_header->info.badpages[i]; > @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > for (k = 0; k < SWAP_CLUSTER_COLS; k++) { > j = (k + col) % SWAP_CLUSTER_COLS; > for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { > + struct swap_cluster_info *ci; > idx = i * SWAP_CLUSTER_COLS + j; > + ci = cluster_info + idx; > if (idx >= nr_clusters) > continue; > - if (cluster_count(&cluster_info[idx])) > + if (ci->count) > continue; > - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); > - cluster_list_add_tail(&p->free_clusters, cluster_info, > - idx); > + ci->flags = CLUSTER_FLAG_FREE; > + list_add_tail(&ci->next, &p->free_clusters); > } > } > return nr_extents; -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list 2024-05-29 8:46 ` Huang, Ying @ 2024-05-30 21:49 ` Chris Li 2024-05-31 2:03 ` Huang, Ying 0 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-05-30 21:49 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Wed, May 29, 2024 at 1:48 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > Previously, the swap cluster used a cluster index as a pointer > > to construct a custom single link list type "swap_cluster_list". > > The next cluster pointer is shared with the cluster->count. > > The assumption is that only the free cluster needs to be put > > on the list. > > > > That assumption is not true for mTHP allocators any more. > > I think that the words aren't correct here. mTHP swap entry allocators > can work with current cluster definition. The current behavior is very problematic though: If we only allocate and free order 4 swap entries, nothing else. After a while, the free cluster will be used up, the swap entry allocation will fail even though there is a lot of swap space left. > > Need to track the non full cluster on the list as well. Move the > > current cluster single link list into standard double link list. > > It's an optimization to track non-full cluster with a list. > > I understand that you want to change cluster list definition. I just In my mind, I was changing the list implementation so it can be tracked non free cluster as well. > feel the wording isn't accurate. Help me improve it. I am happy to adjust the wording in V2, you can provide more feedback then. > > > Remove the cluster getter/setter for accessing the cluster > > struct member. Move the cluster locking in the caller function > > rather than the getter/setter function. That way the locking can > > protect more than one member, e.g. cluster->flag. > > Sorry, I don't understand the locking in above words. I don't find that > we lock/unlock in the original getter/setter functions. I found that > the cluster locking rule for cluster list is changed. Better to make > this explicit. The original cluster single link list add/remove will require si->lock protection as well, because the list head and tail pointer are outside of the cluster pointer. In this regard, the cluster double link list locking rule is very similar. Yes, I move the list_del() outside of the cluster lock, is that what you are referring to as the locking change? > > Change cluster code to use "struct swap_cluster_info *" to > > reference the cluster rather than by using index. That is more > > consistent with the list manipulation. It avoids the repeat > > adding index to the cluser_info. The code is easier to understand. > > > > Remove the cluster next pointer is NULL flag, the double link > > list can handle the empty list pretty well. > > > > The "swap_cluster_info" struct is two pointer bigger, because > > 512 swap entries share one swap struct, it has very little impact > > on the average memory usage per swap entry. Other than the list > > conversion, there is no real function change in this patch. > > On 64bit platform, the size of swap_cluster_info increases from 8 bytes > to 24 bytes. For a 1TB swap device, the memory usage will increase from > 4MB to 12MB. This looks OK for me. Will add the size change calculation in V2 and have you review it again. > > Another choice is to use a customized double linked list using "unsigned > int" as pointer to cluster. That will reduce the size of cluster to 16 > bytes. But it may be not necessary to do that. We can always do that as a follow up step to optimize the 24 byte to 16 byte, at the price of more code complicity. The trick part is the link list head, it is not part of the cluster array, it does not have an index, and will need a special handle for that. > > Anyway, I think that it's better to add more calculation in changelog > for memory usage increment. Sure, I will adjust the commit message in V2. Chris > > > --- > > include/linux/swap.h | 14 ++-- > > mm/swapfile.c | 231 ++++++++++++++------------------------------------- > > 2 files changed, 68 insertions(+), 177 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 11c53692f65f..0d3906eff3c9 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -254,11 +254,12 @@ struct swap_cluster_info { > > * elements correspond to the swap > > * cluster > > */ > > - unsigned int data:24; > > + unsigned int count:16; > > unsigned int flags:8; > > If we use 16bits and 8 bits in bit fields, why not just use u8 and u16 > instead? Not sure about the > > > + struct list_head next; > > "next" isn't a good naming because prev pointer is in list_head too. > The common naming is "list". Sure, I can change it to "list". > > Need to revise comments for swap_cluster_info.lock and add the locking > rule comments for swap_cluster_info.next. Will do. > > > }; > > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ > > + > > > > /* > > * The first page in the swap file is the swap header, which is always marked > > @@ -283,11 +284,6 @@ struct percpu_cluster { > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ > > }; > > > > -struct swap_cluster_list { > > - struct swap_cluster_info head; > > - struct swap_cluster_info tail; > > -}; > > - > > /* > > * The in-memory structure used to track swap areas. > > */ > > @@ -300,7 +296,7 @@ struct swap_info_struct { > > unsigned int max; /* extent of the swap_map */ > > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > > - struct swap_cluster_list free_clusters; /* free clusters list */ > > + struct list_head free_clusters; /* free clusters list */ > > unsigned int lowest_bit; /* index of first free in swap_map */ > > unsigned int highest_bit; /* index of last free in swap_map */ > > unsigned int pages; /* total of usable pages of swap */ > > @@ -333,7 +329,7 @@ struct swap_info_struct { > > * list. > > */ > > struct work_struct discard_work; /* discard worker */ > > - struct swap_cluster_list discard_clusters; /* discard clusters list */ > > + struct list_head discard_clusters; /* discard clusters list */ > > struct plist_node avail_lists[]; /* > > * entries in swap_avail_heads, one > > * entry per node. > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 4f0e8b2ac8aa..205a60c5f9cb 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, > > #endif > > #define LATENCY_LIMIT 256 > > > > -static inline void cluster_set_flag(struct swap_cluster_info *info, > > - unsigned int flag) > > -{ > > - info->flags = flag; > > -} > > - > > -static inline unsigned int cluster_count(struct swap_cluster_info *info) > > -{ > > - return info->data; > > -} > > - > > -static inline void cluster_set_count(struct swap_cluster_info *info, > > - unsigned int c) > > -{ > > - info->data = c; > > -} > > - > > -static inline void cluster_set_count_flag(struct swap_cluster_info *info, > > - unsigned int c, unsigned int f) > > -{ > > - info->flags = f; > > - info->data = c; > > -} > > - > > -static inline unsigned int cluster_next(struct swap_cluster_info *info) > > -{ > > - return info->data; > > -} > > - > > -static inline void cluster_set_next(struct swap_cluster_info *info, > > - unsigned int n) > > -{ > > - info->data = n; > > -} > > - > > -static inline void cluster_set_next_flag(struct swap_cluster_info *info, > > - unsigned int n, unsigned int f) > > -{ > > - info->flags = f; > > - info->data = n; > > -} > > - > > static inline bool cluster_is_free(struct swap_cluster_info *info) > > { > > return info->flags & CLUSTER_FLAG_FREE; > > } > > > > -static inline bool cluster_is_null(struct swap_cluster_info *info) > > -{ > > - return info->flags & CLUSTER_FLAG_NEXT_NULL; > > -} > > - > > -static inline void cluster_set_null(struct swap_cluster_info *info) > > -{ > > - info->flags = CLUSTER_FLAG_NEXT_NULL; > > - info->data = 0; > > -} > > - > > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, > > unsigned long offset) > > { > > @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, > > spin_unlock(&si->lock); > > } > > > > -static inline bool cluster_list_empty(struct swap_cluster_list *list) > > -{ > > - return cluster_is_null(&list->head); > > -} > > - > > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) > > -{ > > - return cluster_next(&list->head); > > -} > > - > > -static void cluster_list_init(struct swap_cluster_list *list) > > -{ > > - cluster_set_null(&list->head); > > - cluster_set_null(&list->tail); > > -} > > - > > -static void cluster_list_add_tail(struct swap_cluster_list *list, > > - struct swap_cluster_info *ci, > > - unsigned int idx) > > -{ > > - if (cluster_list_empty(list)) { > > - cluster_set_next_flag(&list->head, idx, 0); > > - cluster_set_next_flag(&list->tail, idx, 0); > > - } else { > > - struct swap_cluster_info *ci_tail; > > - unsigned int tail = cluster_next(&list->tail); > > - > > - /* > > - * Nested cluster lock, but both cluster locks are > > - * only acquired when we held swap_info_struct->lock > > - */ > > - ci_tail = ci + tail; > > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); > > - cluster_set_next(ci_tail, idx); > > - spin_unlock(&ci_tail->lock); > > - cluster_set_next_flag(&list->tail, idx, 0); > > - } > > -} > > - > > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, > > - struct swap_cluster_info *ci) > > -{ > > - unsigned int idx; > > - > > - idx = cluster_next(&list->head); > > - if (cluster_next(&list->tail) == idx) { > > - cluster_set_null(&list->head); > > - cluster_set_null(&list->tail); > > - } else > > - cluster_set_next_flag(&list->head, > > - cluster_next(&ci[idx]), 0); > > - > > - return idx; > > -} > > - > > /* Add a cluster to discard list and schedule it to do discard */ > > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > - unsigned int idx) > > + struct swap_cluster_info *ci) > > { > > + unsigned int idx = ci - si->cluster_info; > > /* > > * If scan_swap_map_slots() can't find a free cluster, it will check > > * si->swap_map directly. To make sure the discarding cluster isn't > > @@ -462,17 +355,16 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > > > > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); > > - > > + spin_lock_nested(&ci->lock, SINGLE_DEPTH_NESTING); > > If we don't use ci->lock to protect ci->next, we don't need spin_lock here. Good point. Thanks. > > > + list_add_tail(&ci->next, &si->discard_clusters); > > + spin_unlock(&ci->lock); > > schedule_work(&si->discard_work); > > } > > > > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > > { > > - struct swap_cluster_info *ci = si->cluster_info; > > - > > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); > > - cluster_list_add_tail(&si->free_clusters, ci, idx); > > + ci->flags = CLUSTER_FLAG_FREE; > > + list_add_tail(&ci->next, &si->free_clusters); > > } > > > > /* > > @@ -481,21 +373,21 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) > > */ > > static void swap_do_scheduled_discard(struct swap_info_struct *si) > > { > > - struct swap_cluster_info *info, *ci; > > + struct swap_cluster_info *ci; > > unsigned int idx; > > > > - info = si->cluster_info; > > - > > - while (!cluster_list_empty(&si->discard_clusters)) { > > - idx = cluster_list_del_first(&si->discard_clusters, info); > > + while (!list_empty(&si->discard_clusters)) { > > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, next); > > + idx = ci - si->cluster_info; > > spin_unlock(&si->lock); > > > > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, > > SWAPFILE_CLUSTER); > > > > spin_lock(&si->lock); > > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); > > - __free_cluster(si, idx); > > + > > + spin_lock(&ci->lock); > > + __free_cluster(si, ci); > > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > > 0, SWAPFILE_CLUSTER); > > unlock_cluster(ci); > > @@ -521,20 +413,20 @@ static void swap_users_ref_free(struct percpu_ref *ref) > > complete(&si->comp); > > } > > > > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) > > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) > > { > > - struct swap_cluster_info *ci = si->cluster_info; > > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > > > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); > > - cluster_list_del_first(&si->free_clusters, ci); > > - cluster_set_count_flag(ci + idx, 0, 0); > > + VM_BUG_ON(ci - si->cluster_info != idx); > > + list_del(&ci->next); > > + ci->count = 0; > > + ci->flags = 0; > > + return ci; > > } > > > > -static void free_cluster(struct swap_info_struct *si, unsigned long idx) > > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > > { > > - struct swap_cluster_info *ci = si->cluster_info + idx; > > - > > - VM_BUG_ON(cluster_count(ci) != 0); > > + VM_BUG_ON(ci->count != 0); > > /* > > * If the swap is discardable, prepare discard the cluster > > * instead of free it immediately. The cluster will be freed > > @@ -542,11 +434,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) > > */ > > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == > > (SWP_WRITEOK | SWP_PAGE_DISCARD)) { > > - swap_cluster_schedule_discard(si, idx); > > + swap_cluster_schedule_discard(si, ci); > > return; > > } > > > > - __free_cluster(si, idx); > > + __free_cluster(si, ci); > > } > > > > /* > > @@ -559,15 +451,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, > > unsigned long count) > > { > > unsigned long idx = page_nr / SWAPFILE_CLUSTER; > > + struct swap_cluster_info *ci = cluster_info + idx; > > > > if (!cluster_info) > > return; > > - if (cluster_is_free(&cluster_info[idx])) > > + if (cluster_is_free(ci)) > > alloc_cluster(p, idx); > > > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); > > - cluster_set_count(&cluster_info[idx], > > - cluster_count(&cluster_info[idx]) + count); > > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); > > + ci->count += count; > > } > > > > /* > > @@ -581,24 +473,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, > > } > > > > /* > > - * The cluster corresponding to page_nr decreases one usage. If the usage > > - * counter becomes 0, which means no page in the cluster is in using, we can > > - * optionally discard the cluster and add it to free cluster list. > > + * The cluster ci decreases one usage. If the usage counter becomes 0, > > + * which means no page in the cluster is in using, we can optionally discard > > + * the cluster and add it to free cluster list. > > */ > > -static void dec_cluster_info_page(struct swap_info_struct *p, > > - struct swap_cluster_info *cluster_info, unsigned long page_nr) > > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) > > { > > - unsigned long idx = page_nr / SWAPFILE_CLUSTER; > > - > > - if (!cluster_info) > > + if (!p->cluster_info) > > return; > > > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); > > - cluster_set_count(&cluster_info[idx], > > - cluster_count(&cluster_info[idx]) - 1); > > + VM_BUG_ON(ci->count == 0); > > + ci->count--; > > > > - if (cluster_count(&cluster_info[idx]) == 0) > > - free_cluster(p, idx); > > + if (!ci->count) > > + free_cluster(p, ci); > > } > > > > /* > > @@ -611,10 +499,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > > { > > struct percpu_cluster *percpu_cluster; > > bool conflict; > > - > > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > offset /= SWAPFILE_CLUSTER; > > - conflict = !cluster_list_empty(&si->free_clusters) && > > - offset != cluster_list_first(&si->free_clusters) && > > + conflict = !list_empty(&si->free_clusters) && > > + offset != first - si->cluster_info && > > cluster_is_free(&si->cluster_info[offset]); > > > > if (!conflict) > > @@ -655,10 +543,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > cluster = this_cpu_ptr(si->percpu_cluster); > > tmp = cluster->next[order]; > > if (tmp == SWAP_NEXT_INVALID) { > > - if (!cluster_list_empty(&si->free_clusters)) { > > - tmp = cluster_next(&si->free_clusters.head) * > > - SWAPFILE_CLUSTER; > > - } else if (!cluster_list_empty(&si->discard_clusters)) { > > + if (!list_empty(&si->free_clusters)) { > > + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > + list_del(&ci->next); > > + spin_lock(&ci->lock); > > + ci->flags = 0; > > + spin_unlock(&ci->lock); > > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > + } else if (!list_empty(&si->discard_clusters)) { > > /* > > * we don't have free cluster but have some clusters in > > * discarding, do discard now and reclaim them, then > > @@ -670,7 +562,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > goto new_cluster; > > } else > > return false; > > - } > > + } else > > + ci = si->cluster_info + tmp; > > > > /* > > * Other CPUs can use our cluster if they can't find a free cluster, > > @@ -1062,8 +955,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > > > > ci = lock_cluster(si, offset); > > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > > - cluster_set_count_flag(ci, 0, 0); > > - free_cluster(si, idx); > > + ci->count = 0; > > + ci->flags = 0; > > + free_cluster(si, ci); > > unlock_cluster(ci); > > swap_range_free(si, offset, SWAPFILE_CLUSTER); > > } > > @@ -1336,7 +1230,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) > > count = p->swap_map[offset]; > > VM_BUG_ON(count != SWAP_HAS_CACHE); > > p->swap_map[offset] = 0; > > - dec_cluster_info_page(p, p->cluster_info, offset); > > + dec_cluster_info_page(p, ci); > > unlock_cluster(ci); > > > > mem_cgroup_uncharge_swap(entry, 1); > > @@ -2985,8 +2879,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > > > > nr_good_pages = maxpages - 1; /* omit header page */ > > > > - cluster_list_init(&p->free_clusters); > > - cluster_list_init(&p->discard_clusters); > > + INIT_LIST_HEAD(&p->free_clusters); > > + INIT_LIST_HEAD(&p->discard_clusters); > > > > for (i = 0; i < swap_header->info.nr_badpages; i++) { > > unsigned int page_nr = swap_header->info.badpages[i]; > > @@ -3037,14 +2931,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > > for (k = 0; k < SWAP_CLUSTER_COLS; k++) { > > j = (k + col) % SWAP_CLUSTER_COLS; > > for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { > > + struct swap_cluster_info *ci; > > idx = i * SWAP_CLUSTER_COLS + j; > > + ci = cluster_info + idx; > > if (idx >= nr_clusters) > > continue; > > - if (cluster_count(&cluster_info[idx])) > > + if (ci->count) > > continue; > > - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); > > - cluster_list_add_tail(&p->free_clusters, cluster_info, > > - idx); > > + ci->flags = CLUSTER_FLAG_FREE; > > + list_add_tail(&ci->next, &p->free_clusters); > > } > > } > > return nr_extents; > > -- > Best Regards, > Huang, Ying > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 1/2] mm: swap: swap cluster switch to double link list 2024-05-30 21:49 ` Chris Li @ 2024-05-31 2:03 ` Huang, Ying 0 siblings, 0 replies; 43+ messages in thread From: Huang, Ying @ 2024-05-31 2:03 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > On Wed, May 29, 2024 at 1:48 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > Previously, the swap cluster used a cluster index as a pointer >> > to construct a custom single link list type "swap_cluster_list". >> > The next cluster pointer is shared with the cluster->count. >> > The assumption is that only the free cluster needs to be put >> > on the list. >> > >> > That assumption is not true for mTHP allocators any more. >> >> I think that the words aren't correct here. mTHP swap entry allocators >> can work with current cluster definition. > > The current behavior is very problematic though: > > If we only allocate and free order 4 swap entries, nothing else. After > a while, the free cluster will be used up, the swap entry allocation > will fail even though there is a lot of swap space left. The original behavior doesn't work well for order-0 allocation too. percpu_cluster and quick (cluster) scan path cannot be used for fragmented swap devices. >> > Need to track the non full cluster on the list as well. Move the >> > current cluster single link list into standard double link list. >> >> It's an optimization to track non-full cluster with a list. >> >> I understand that you want to change cluster list definition. I just > > In my mind, I was changing the list implementation so it can be > tracked non free cluster as well. > >> feel the wording isn't accurate. > > Help me improve it. I am happy to adjust the wording in V2, you can > provide more feedback then. I suggest you to focus on improvement. The original implementation hasn't the assumption that it's the best or perfect. It improves from its base and you can continue to improve it for more situations. Describe the situation where current implementation doesn't performance well and how do you improve it. Better with the cost. >> >> > Remove the cluster getter/setter for accessing the cluster >> > struct member. Move the cluster locking in the caller function >> > rather than the getter/setter function. That way the locking can >> > protect more than one member, e.g. cluster->flag. >> >> Sorry, I don't understand the locking in above words. I don't find that >> we lock/unlock in the original getter/setter functions. I found that >> the cluster locking rule for cluster list is changed. Better to make >> this explicit. > > The original cluster single link list add/remove will require si->lock > protection as well, because the list head and tail pointer are outside > of the cluster pointer. > In this regard, the cluster double link list locking rule is very > similar. Yes, I move the list_del() outside of the cluster lock, is > that what you are referring to as the locking change? In the original implementation, ci->lock is held when changing ci->data (in fact next here). Now, you don't need the ci->lock. This is a locking rule change, I suggest you to make it explicit in change log and comments. >> > Change cluster code to use "struct swap_cluster_info *" to >> > reference the cluster rather than by using index. That is more >> > consistent with the list manipulation. It avoids the repeat >> > adding index to the cluser_info. The code is easier to understand. >> > >> > Remove the cluster next pointer is NULL flag, the double link >> > list can handle the empty list pretty well. >> > >> > The "swap_cluster_info" struct is two pointer bigger, because >> > 512 swap entries share one swap struct, it has very little impact >> > on the average memory usage per swap entry. Other than the list >> > conversion, there is no real function change in this patch. >> >> On 64bit platform, the size of swap_cluster_info increases from 8 bytes >> to 24 bytes. For a 1TB swap device, the memory usage will increase from >> 4MB to 12MB. This looks OK for me. > > Will add the size change calculation in V2 and have you review it again. > >> >> Another choice is to use a customized double linked list using "unsigned >> int" as pointer to cluster. That will reduce the size of cluster to 16 >> bytes. But it may be not necessary to do that. > > We can always do that as a follow up step to optimize the 24 byte to > 16 byte, at the price of more code complicity. > The trick part is the link list head, it is not part of the cluster > array, it does not have an index, and will need a special handle for > that. In theory, you can define a "struct list_u32_head" and a set of list_u32_* functions. But I don't find that it's necessary to do that. >> >> Anyway, I think that it's better to add more calculation in changelog >> for memory usage increment. > > Sure, I will adjust the commit message in V2. > > Chris > >> >> > --- >> > include/linux/swap.h | 14 ++-- >> > mm/swapfile.c | 231 ++++++++++++++------------------------------------- >> > 2 files changed, 68 insertions(+), 177 deletions(-) >> > >> > diff --git a/include/linux/swap.h b/include/linux/swap.h >> > index 11c53692f65f..0d3906eff3c9 100644 >> > --- a/include/linux/swap.h >> > +++ b/include/linux/swap.h >> > @@ -254,11 +254,12 @@ struct swap_cluster_info { >> > * elements correspond to the swap >> > * cluster >> > */ >> > - unsigned int data:24; >> > + unsigned int count:16; >> > unsigned int flags:8; >> >> If we use 16bits and 8 bits in bit fields, why not just use u8 and u16 >> instead? > Not sure about the ? u16 count; u8 flags; >> >> > + struct list_head next; >> >> "next" isn't a good naming because prev pointer is in list_head too. >> The common naming is "list". > > Sure, I can change it to "list". > >> >> Need to revise comments for swap_cluster_info.lock and add the locking >> rule comments for swap_cluster_info.next. > > Will do. > >> >> > }; >> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ >> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ >> > + >> > [snip] -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list 2024-05-24 17:17 [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li 2024-05-24 17:17 ` [PATCH 1/2] mm: swap: swap cluster switch to double link list Chris Li @ 2024-05-24 17:17 ` Chris Li 2024-06-07 10:35 ` Ryan Roberts 2024-05-28 3:07 ` [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Barry Song ` (2 subsequent siblings) 4 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-05-24 17:17 UTC (permalink / raw) To: Andrew Morton Cc: Kairui Song, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm, Chris Li, Barry Song Track the nonfull cluster as well as the empty cluster on lists. Each order has one nonfull cluster list. The cluster will remember which order it was used during new cluster allocation. When the cluster has free entry, add to the nonfull[order] list. When the free cluster list is empty, also allocate from the nonempty list of that order. This improves the mTHP swap allocation success rate. There are limitations if the distribution of numbers of different orders of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to order A while later time there are a lot of order B allocation while very little allocation in order A. Currently the cluster used by order A will not reused by order B unless the cluster is 100% empty. This situation is best addressed by the longer term "swap buddy allocator", in future patches. --- include/linux/swap.h | 4 ++++ mm/swapfile.c | 25 +++++++++++++++++++++++-- 2 files changed, 27 insertions(+), 2 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 0d3906eff3c9..1b7f0794b9bf 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -255,10 +255,12 @@ struct swap_cluster_info { * cluster */ unsigned int count:16; + unsigned int order:8; unsigned int flags:8; struct list_head next; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ /* @@ -297,6 +299,8 @@ struct swap_info_struct { unsigned char *swap_map; /* vmalloc'ed array of usage counts */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; + /* list of cluster that contains at least one free slot */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 205a60c5f9cb..51923aba500e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { + if (ci->flags & CLUSTER_FLAG_NONFULL) + list_move_tail(&ci->next, &si->free_clusters); + else + list_add_tail(&ci->next, &si->free_clusters); ci->flags = CLUSTER_FLAG_FREE; - list_add_tail(&ci->next, &si->free_clusters); } /* @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste ci->count--; if (!ci->count) - free_cluster(p, ci); + return free_cluster(p, ci); + + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); + ci->flags |= CLUSTER_FLAG_NONFULL; + } } /* @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); list_del(&ci->next); spin_lock(&ci->lock); + ci->order = order; + ci->flags = 0; + spin_unlock(&ci->lock); + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->nonfull_clusters[order])) { + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); + list_del(&ci->next); + spin_lock(&ci->lock); ci->flags = 0; spin_unlock(&ci->lock); tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, break; tmp += nr_pages; } + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); unlock_cluster(ci); } if (tmp >= max) { @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count = 0; + ci->order = 0; ci->flags = 0; free_cluster(si, ci); unlock_cluster(ci); @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, INIT_LIST_HEAD(&p->free_clusters); INIT_LIST_HEAD(&p->discard_clusters); + for (i = 0; i < SWAP_NR_ORDERS; i++) + INIT_LIST_HEAD(&p->nonfull_clusters[i]); + for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; if (page_nr == 0 || page_nr > swap_header->info.last_page) -- 2.45.1.288.g0e0cd299f1-goog ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list 2024-05-24 17:17 ` [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li @ 2024-06-07 10:35 ` Ryan Roberts 2024-06-07 10:57 ` Ryan Roberts 2024-06-07 20:52 ` Chris Li 0 siblings, 2 replies; 43+ messages in thread From: Ryan Roberts @ 2024-06-07 10:35 UTC (permalink / raw) To: Chris Li, Andrew Morton Cc: Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song On 24/05/2024 18:17, Chris Li wrote: > Track the nonfull cluster as well as the empty cluster > on lists. Each order has one nonfull cluster list. > > The cluster will remember which order it was used during > new cluster allocation. > > When the cluster has free entry, add to the nonfull[order] > list. When the free cluster list is empty, also allocate > from the nonempty list of that order. > > This improves the mTHP swap allocation success rate. If I've understood correctly, the aim here is to link all the current per-cpu clusters for a given order together so that if a cpu can't allocate a new cluster for a given order, then it can steal another CPU's current cluster for that order? If that's the intent, couldn't that be done just by iterating over the per-cpu, per-order cluster pointers? Then you don't need all the linked list churn (althogh I like the linked list changes as a nice cleanup, I'm not sure the churn is neccessary for this change?). There would likely need to be some locking considerations, but it would also allow you to get access to the next entry within the cluster for allocation. However, fundamentally, I don't think this change solves the problem; it just takes a bit longer before the allocation fails. The real problem is fragmentation due to freeing individual pages from swap entries at different times. Wouldn't it be better to just extend scanning to support high order allocations? Then we can steal a high order block from any cluster, even clusters that were previously full, just like we currently do for order-0. Given we are already falling back to this path for order-0, I don't think it would be any more expensive; infact its less expensive because we only scan once for the high order block, rather than scan for every split order-0 page. Of course that still doesn't solve the proplem entirely; if swap is so fragmented that there is no contiguous block of the required order then you still have to fall back to splitting. As an extra optimization, you could store the largest contiguous free space available in each cluster to avoid scanning in case its too small? > > There are limitations if the distribution of numbers of > different orders of mTHP changes a lot. e.g. there are a lot > of nonfull cluster assign to order A while later time there > are a lot of order B allocation while very little allocation > in order A. Currently the cluster used by order A will not > reused by order B unless the cluster is 100% empty. > > This situation is best addressed by the longer term "swap > buddy allocator", in future patches. > --- > include/linux/swap.h | 4 ++++ > mm/swapfile.c | 25 +++++++++++++++++++++++-- > 2 files changed, 27 insertions(+), 2 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 0d3906eff3c9..1b7f0794b9bf 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -255,10 +255,12 @@ struct swap_cluster_info { > * cluster > */ > unsigned int count:16; > + unsigned int order:8; > unsigned int flags:8; > struct list_head next; > }; > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ > > > /* > @@ -297,6 +299,8 @@ struct swap_info_struct { > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > struct list_head free_clusters; /* free clusters list */ > + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; > + /* list of cluster that contains at least one free slot */ > unsigned int lowest_bit; /* index of first free in swap_map */ > unsigned int highest_bit; /* index of last free in swap_map */ > unsigned int pages; /* total of usable pages of swap */ > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 205a60c5f9cb..51923aba500e 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > { > + if (ci->flags & CLUSTER_FLAG_NONFULL) > + list_move_tail(&ci->next, &si->free_clusters); > + else > + list_add_tail(&ci->next, &si->free_clusters); > ci->flags = CLUSTER_FLAG_FREE; > - list_add_tail(&ci->next, &si->free_clusters); > } > > /* > @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste > ci->count--; > > if (!ci->count) > - free_cluster(p, ci); > + return free_cluster(p, ci); > + > + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { > + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); > + ci->flags |= CLUSTER_FLAG_NONFULL; > + } > } > > /* > @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > list_del(&ci->next); > spin_lock(&ci->lock); > + ci->order = order; > + ci->flags = 0; > + spin_unlock(&ci->lock); > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > + } else if (!list_empty(&si->nonfull_clusters[order])) { > + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); > + list_del(&ci->next); > + spin_lock(&ci->lock); > ci->flags = 0; > spin_unlock(&ci->lock); > tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; This looks wrong to me; if the cluster is on the nonfull list then it will have had some entries already allocated (by another cpu). So pointing tmp to the first block in the cluster will never yield a free block. The cpu from which you are stealing the cluster stores the next free block location in its per-cpu structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is a better approach than the nonfull list? Additionally, this cluster will be stored back to this cpu's current cluster at the bottom of the function. That may or may not be what you intended. > @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > break; > tmp += nr_pages; > } > + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); > unlock_cluster(ci); > } > if (tmp >= max) { > @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > ci = lock_cluster(si, offset); > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > ci->count = 0; > + ci->order = 0; > ci->flags = 0; > free_cluster(si, ci); > unlock_cluster(ci); > @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > INIT_LIST_HEAD(&p->free_clusters); > INIT_LIST_HEAD(&p->discard_clusters); > > + for (i = 0; i < SWAP_NR_ORDERS; i++) > + INIT_LIST_HEAD(&p->nonfull_clusters[i]); > + > for (i = 0; i < swap_header->info.nr_badpages; i++) { > unsigned int page_nr = swap_header->info.badpages[i]; > if (page_nr == 0 || page_nr > swap_header->info.last_page) > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list 2024-06-07 10:35 ` Ryan Roberts @ 2024-06-07 10:57 ` Ryan Roberts 2024-06-07 20:53 ` Chris Li 2024-06-07 20:52 ` Chris Li 1 sibling, 1 reply; 43+ messages in thread From: Ryan Roberts @ 2024-06-07 10:57 UTC (permalink / raw) To: Chris Li, Andrew Morton Cc: Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song On 07/06/2024 11:35, Ryan Roberts wrote: > On 24/05/2024 18:17, Chris Li wrote: >> Track the nonfull cluster as well as the empty cluster >> on lists. Each order has one nonfull cluster list. >> >> The cluster will remember which order it was used during >> new cluster allocation. >> >> When the cluster has free entry, add to the nonfull[order] >> list. When the free cluster list is empty, also allocate >> from the nonempty list of that order. >> >> This improves the mTHP swap allocation success rate. > > If I've understood correctly, the aim here is to link all the current per-cpu > clusters for a given order together so that if a cpu can't allocate a new > cluster for a given order, then it can steal another CPU's current cluster for > that order? > > If that's the intent, couldn't that be done just by iterating over the per-cpu, > per-order cluster pointers? Then you don't need all the linked list churn > (althogh I like the linked list changes as a nice cleanup, I'm not sure the > churn is neccessary for this change?). There would likely need to be some > locking considerations, but it would also allow you to get access to the next > entry within the cluster for allocation. > > However, fundamentally, I don't think this change solves the problem; it just > takes a bit longer before the allocation fails. The real problem is > fragmentation due to freeing individual pages from swap entries at different times. > > Wouldn't it be better to just extend scanning to support high order allocations? > Then we can steal a high order block from any cluster, even clusters that were > previously full, just like we currently do for order-0. Given we are already > falling back to this path for order-0, I don't think it would be any more > expensive; infact its less expensive because we only scan once for the high > order block, rather than scan for every split order-0 page. > > Of course that still doesn't solve the proplem entirely; if swap is so > fragmented that there is no contiguous block of the required order then you > still have to fall back to splitting. As an extra optimization, you could store > the largest contiguous free space available in each cluster to avoid scanning in > case its too small? > > >> >> There are limitations if the distribution of numbers of >> different orders of mTHP changes a lot. e.g. there are a lot >> of nonfull cluster assign to order A while later time there >> are a lot of order B allocation while very little allocation >> in order A. Currently the cluster used by order A will not >> reused by order B unless the cluster is 100% empty. >> >> This situation is best addressed by the longer term "swap >> buddy allocator", in future patches. >> --- >> include/linux/swap.h | 4 ++++ >> mm/swapfile.c | 25 +++++++++++++++++++++++-- >> 2 files changed, 27 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index 0d3906eff3c9..1b7f0794b9bf 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -255,10 +255,12 @@ struct swap_cluster_info { >> * cluster >> */ >> unsigned int count:16; >> + unsigned int order:8; >> unsigned int flags:8; >> struct list_head next; >> }; >> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ >> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ >> >> >> /* >> @@ -297,6 +299,8 @@ struct swap_info_struct { >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ >> struct list_head free_clusters; /* free clusters list */ >> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; >> + /* list of cluster that contains at least one free slot */ >> unsigned int lowest_bit; /* index of first free in swap_map */ >> unsigned int highest_bit; /* index of last free in swap_map */ >> unsigned int pages; /* total of usable pages of swap */ >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 205a60c5f9cb..51923aba500e 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> >> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) >> { >> + if (ci->flags & CLUSTER_FLAG_NONFULL) >> + list_move_tail(&ci->next, &si->free_clusters); >> + else >> + list_add_tail(&ci->next, &si->free_clusters); >> ci->flags = CLUSTER_FLAG_FREE; >> - list_add_tail(&ci->next, &si->free_clusters); >> } >> >> /* >> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste >> ci->count--; >> >> if (!ci->count) >> - free_cluster(p, ci); >> + return free_cluster(p, ci); >> + >> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); >> + ci->flags |= CLUSTER_FLAG_NONFULL; >> + } >> } >> >> /* >> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); >> list_del(&ci->next); >> spin_lock(&ci->lock); >> + ci->order = order; >> + ci->flags = 0; >> + spin_unlock(&ci->lock); >> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; >> + } else if (!list_empty(&si->nonfull_clusters[order])) { >> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); >> + list_del(&ci->next); >> + spin_lock(&ci->lock); >> ci->flags = 0; >> spin_unlock(&ci->lock); >> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > This looks wrong to me; if the cluster is on the nonfull list then it will have > had some entries already allocated (by another cpu). So pointing tmp to the > first block in the cluster will never yield a free block. The cpu from which you > are stealing the cluster stores the next free block location in its per-cpu > structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is > a better approach than the nonfull list? Ahh; of course the cluster scan below will move this along to a free block. > > Additionally, this cluster will be stored back to this cpu's current cluster at > the bottom of the function. That may or may not be what you intended. > >> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> break; >> tmp += nr_pages; >> } >> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); >> unlock_cluster(ci); >> } >> if (tmp >= max) { >> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) >> ci = lock_cluster(si, offset); >> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); >> ci->count = 0; >> + ci->order = 0; >> ci->flags = 0; >> free_cluster(si, ci); >> unlock_cluster(ci); >> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, >> INIT_LIST_HEAD(&p->free_clusters); >> INIT_LIST_HEAD(&p->discard_clusters); >> >> + for (i = 0; i < SWAP_NR_ORDERS; i++) >> + INIT_LIST_HEAD(&p->nonfull_clusters[i]); >> + >> for (i = 0; i < swap_header->info.nr_badpages; i++) { >> unsigned int page_nr = swap_header->info.badpages[i]; >> if (page_nr == 0 || page_nr > swap_header->info.last_page) >> > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list 2024-06-07 10:57 ` Ryan Roberts @ 2024-06-07 20:53 ` Chris Li 0 siblings, 0 replies; 43+ messages in thread From: Chris Li @ 2024-06-07 20:53 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song On Fri, Jun 7, 2024 at 3:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 07/06/2024 11:35, Ryan Roberts wrote: > > On 24/05/2024 18:17, Chris Li wrote: > >> Track the nonfull cluster as well as the empty cluster > >> on lists. Each order has one nonfull cluster list. > >> > >> The cluster will remember which order it was used during > >> new cluster allocation. > >> > >> When the cluster has free entry, add to the nonfull[order] > >> list. When the free cluster list is empty, also allocate > >> from the nonempty list of that order. > >> > >> This improves the mTHP swap allocation success rate. > > > > If I've understood correctly, the aim here is to link all the current per-cpu > > clusters for a given order together so that if a cpu can't allocate a new > > cluster for a given order, then it can steal another CPU's current cluster for > > that order? > > > > If that's the intent, couldn't that be done just by iterating over the per-cpu, > > per-order cluster pointers? Then you don't need all the linked list churn > > (althogh I like the linked list changes as a nice cleanup, I'm not sure the > > churn is neccessary for this change?). There would likely need to be some > > locking considerations, but it would also allow you to get access to the next > > entry within the cluster for allocation. > > > > However, fundamentally, I don't think this change solves the problem; it just > > takes a bit longer before the allocation fails. The real problem is > > fragmentation due to freeing individual pages from swap entries at different times. > > > > Wouldn't it be better to just extend scanning to support high order allocations? > > Then we can steal a high order block from any cluster, even clusters that were > > previously full, just like we currently do for order-0. Given we are already > > falling back to this path for order-0, I don't think it would be any more > > expensive; infact its less expensive because we only scan once for the high > > order block, rather than scan for every split order-0 page. > > > > Of course that still doesn't solve the proplem entirely; if swap is so > > fragmented that there is no contiguous block of the required order then you > > still have to fall back to splitting. As an extra optimization, you could store > > the largest contiguous free space available in each cluster to avoid scanning in > > case its too small? > > > > > >> > >> There are limitations if the distribution of numbers of > >> different orders of mTHP changes a lot. e.g. there are a lot > >> of nonfull cluster assign to order A while later time there > >> are a lot of order B allocation while very little allocation > >> in order A. Currently the cluster used by order A will not > >> reused by order B unless the cluster is 100% empty. > >> > >> This situation is best addressed by the longer term "swap > >> buddy allocator", in future patches. > >> --- > >> include/linux/swap.h | 4 ++++ > >> mm/swapfile.c | 25 +++++++++++++++++++++++-- > >> 2 files changed, 27 insertions(+), 2 deletions(-) > >> > >> diff --git a/include/linux/swap.h b/include/linux/swap.h > >> index 0d3906eff3c9..1b7f0794b9bf 100644 > >> --- a/include/linux/swap.h > >> +++ b/include/linux/swap.h > >> @@ -255,10 +255,12 @@ struct swap_cluster_info { > >> * cluster > >> */ > >> unsigned int count:16; > >> + unsigned int order:8; > >> unsigned int flags:8; > >> struct list_head next; > >> }; > >> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > >> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ > >> > >> > >> /* > >> @@ -297,6 +299,8 @@ struct swap_info_struct { > >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > >> struct list_head free_clusters; /* free clusters list */ > >> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; > >> + /* list of cluster that contains at least one free slot */ > >> unsigned int lowest_bit; /* index of first free in swap_map */ > >> unsigned int highest_bit; /* index of last free in swap_map */ > >> unsigned int pages; /* total of usable pages of swap */ > >> diff --git a/mm/swapfile.c b/mm/swapfile.c > >> index 205a60c5f9cb..51923aba500e 100644 > >> --- a/mm/swapfile.c > >> +++ b/mm/swapfile.c > >> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > >> > >> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > >> { > >> + if (ci->flags & CLUSTER_FLAG_NONFULL) > >> + list_move_tail(&ci->next, &si->free_clusters); > >> + else > >> + list_add_tail(&ci->next, &si->free_clusters); > >> ci->flags = CLUSTER_FLAG_FREE; > >> - list_add_tail(&ci->next, &si->free_clusters); > >> } > >> > >> /* > >> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste > >> ci->count--; > >> > >> if (!ci->count) > >> - free_cluster(p, ci); > >> + return free_cluster(p, ci); > >> + > >> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { > >> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); > >> + ci->flags |= CLUSTER_FLAG_NONFULL; > >> + } > >> } > >> > >> /* > >> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > >> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > >> list_del(&ci->next); > >> spin_lock(&ci->lock); > >> + ci->order = order; > >> + ci->flags = 0; > >> + spin_unlock(&ci->lock); > >> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > >> + } else if (!list_empty(&si->nonfull_clusters[order])) { > >> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); > >> + list_del(&ci->next); > >> + spin_lock(&ci->lock); > >> ci->flags = 0; > >> spin_unlock(&ci->lock); > >> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > > > This looks wrong to me; if the cluster is on the nonfull list then it will have > > had some entries already allocated (by another cpu). So pointing tmp to the > > first block in the cluster will never yield a free block. The cpu from which you > > are stealing the cluster stores the next free block location in its per-cpu > > structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is > > a better approach than the nonfull list? > > Ahh; of course the cluster scan below will move this along to a free block. You mean the (offset < max) loop, right? Agree. Chris > > > > > Additionally, this cluster will be stored back to this cpu's current cluster at > > the bottom of the function. That may or may not be what you intended. > > > >> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > >> break; > >> tmp += nr_pages; > >> } > >> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); > >> unlock_cluster(ci); > >> } > >> if (tmp >= max) { > >> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > >> ci = lock_cluster(si, offset); > >> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > >> ci->count = 0; > >> + ci->order = 0; > >> ci->flags = 0; > >> free_cluster(si, ci); > >> unlock_cluster(ci); > >> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > >> INIT_LIST_HEAD(&p->free_clusters); > >> INIT_LIST_HEAD(&p->discard_clusters); > >> > >> + for (i = 0; i < SWAP_NR_ORDERS; i++) > >> + INIT_LIST_HEAD(&p->nonfull_clusters[i]); > >> + > >> for (i = 0; i < swap_header->info.nr_badpages; i++) { > >> unsigned int page_nr = swap_header->info.badpages[i]; > >> if (page_nr == 0 || page_nr > swap_header->info.last_page) > >> > > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list 2024-06-07 10:35 ` Ryan Roberts 2024-06-07 10:57 ` Ryan Roberts @ 2024-06-07 20:52 ` Chris Li 2024-06-10 11:18 ` Ryan Roberts 1 sibling, 1 reply; 43+ messages in thread From: Chris Li @ 2024-06-07 20:52 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song On Fri, Jun 7, 2024 at 3:35 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 24/05/2024 18:17, Chris Li wrote: > > Track the nonfull cluster as well as the empty cluster > > on lists. Each order has one nonfull cluster list. > > > > The cluster will remember which order it was used during > > new cluster allocation. > > > > When the cluster has free entry, add to the nonfull[order] > > list. When the free cluster list is empty, also allocate > > from the nonempty list of that order. > > > > This improves the mTHP swap allocation success rate. > > If I've understood correctly, the aim here is to link all the current per-cpu > clusters for a given order together so that if a cpu can't allocate a new > cluster for a given order, then it can steal another CPU's current cluster for > that order? Stealing other CPU's *current* cluster is not the intent. The intent is after all current per-cpu done with this cluster(full), those full clusters are not tracked by any per-cpu struct. When those full clusters become non-full. Track it in the global nonfull cluster list. The per-cpu allocation can take a cluster from that nonfull cluster list and start allocating from it. The V1 code does not specifically check for the stealing behavior, the V2 code will prevent that from happening. Basically each cluster has 4 states and owners: 1) empty, owned by a global free cluster list. 2) per cpu allocating. owned by per CPU current. 3) nonfull (also non empty). own global nonfull list. 4) full, currently not tracked, we can track it under global full list. When the per cpu runs out of free cluster, it can take a cluster from 3) and move it to 2). > > If that's the intent, couldn't that be done just by iterating over the per-cpu, > per-order cluster pointers? Then you don't need all the linked list churn Again, that is not the intent. > (althogh I like the linked list changes as a nice cleanup, I'm not sure the > churn is neccessary for this change?). There would likely need to be some > locking considerations, but it would also allow you to get access to the next > entry within the cluster for allocation. > > However, fundamentally, I don't think this change solves the problem; it just > takes a bit longer before the allocation fails. The real problem is > fragmentation due to freeing individual pages from swap entries at different times. It definitely helps to find nonfull clusters quicker. Please take a look at my above comment and read the patch again. > > Wouldn't it be better to just extend scanning to support high order allocations? > Then we can steal a high order block from any cluster, even clusters that were Steal from higher order causes the higher order harder to allocate, that is downside. In my mind, ideally have some high order cluster reservation scheme so the high order one doesn't mix with the low order one. > previously full, just like we currently do for order-0. Given we are already > falling back to this path for order-0, I don't think it would be any more > expensive; infact its less expensive because we only scan once for the high > order block, rather than scan for every split order-0 page. > > Of course that still doesn't solve the proplem entirely; if swap is so > fragmented that there is no contiguous block of the required order then you > still have to fall back to splitting. As an extra optimization, you could store Exactly. That is why I think some high order cluster reservation scheme is needed for a short term solution. The change itself is not too complicated if we can agree on this approach. > the largest contiguous free space available in each cluster to avoid scanning in > case its too small? Avoid scanning does just get to the non available high order result quicker. Does not seem to help increase the high order allocation success rate. > > > > > > There are limitations if the distribution of numbers of > > different orders of mTHP changes a lot. e.g. there are a lot > > of nonfull cluster assign to order A while later time there > > are a lot of order B allocation while very little allocation > > in order A. Currently the cluster used by order A will not > > reused by order B unless the cluster is 100% empty. > > > > This situation is best addressed by the longer term "swap > > buddy allocator", in future patches. > > --- > > include/linux/swap.h | 4 ++++ > > mm/swapfile.c | 25 +++++++++++++++++++++++-- > > 2 files changed, 27 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 0d3906eff3c9..1b7f0794b9bf 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -255,10 +255,12 @@ struct swap_cluster_info { > > * cluster > > */ > > unsigned int count:16; > > + unsigned int order:8; > > unsigned int flags:8; > > struct list_head next; > > }; > > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > > +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ > > > > > > /* > > @@ -297,6 +299,8 @@ struct swap_info_struct { > > unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > > struct list_head free_clusters; /* free clusters list */ > > + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; > > + /* list of cluster that contains at least one free slot */ > > unsigned int lowest_bit; /* index of first free in swap_map */ > > unsigned int highest_bit; /* index of last free in swap_map */ > > unsigned int pages; /* total of usable pages of swap */ > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 205a60c5f9cb..51923aba500e 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > > > static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > > { > > + if (ci->flags & CLUSTER_FLAG_NONFULL) > > + list_move_tail(&ci->next, &si->free_clusters); > > + else > > + list_add_tail(&ci->next, &si->free_clusters); > > ci->flags = CLUSTER_FLAG_FREE; > > - list_add_tail(&ci->next, &si->free_clusters); > > } > > > > /* > > @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste > > ci->count--; > > > > if (!ci->count) > > - free_cluster(p, ci); > > + return free_cluster(p, ci); > > + > > + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { > > + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); > > + ci->flags |= CLUSTER_FLAG_NONFULL; > > + } > > } > > > > /* > > @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > > list_del(&ci->next); > > spin_lock(&ci->lock); > > + ci->order = order; > > + ci->flags = 0; > > + spin_unlock(&ci->lock); > > + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > + } else if (!list_empty(&si->nonfull_clusters[order])) { > > + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); > > + list_del(&ci->next); > > + spin_lock(&ci->lock); > > ci->flags = 0; > > spin_unlock(&ci->lock); > > tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > This looks wrong to me; if the cluster is on the nonfull list then it will have > had some entries already allocated (by another cpu). So pointing tmp to the > first block in the cluster will never yield a free block. The cpu from which you I believe it will scan until it finds a free block with alignment down in the offset < max loop. while (offset < max) { if (swap_range_empty(si->swap_map, offset, nr_pages)) break; offset += nr_pages; } > are stealing the cluster stores the next free block location in its per-cpu > structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is > a better approach than the nonfull list? No, stealing is not the intent. The intent is quickly finding the non full cluster NOT in other per cpu allocation. > > Additionally, this cluster will be stored back to this cpu's current cluster at > the bottom of the function. That may or may not be what you intended. That is what I intended. It remembers the current allocating cluster, in case this cluster has more than one high order swap entries. Chris > > > @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > > break; > > tmp += nr_pages; > > } > > + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); > > unlock_cluster(ci); > > } > > if (tmp >= max) { > > @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > > ci = lock_cluster(si, offset); > > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > > ci->count = 0; > > + ci->order = 0; > > ci->flags = 0; > > free_cluster(si, ci); > > unlock_cluster(ci); > > @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > > INIT_LIST_HEAD(&p->free_clusters); > > INIT_LIST_HEAD(&p->discard_clusters); > > > > + for (i = 0; i < SWAP_NR_ORDERS; i++) > > + INIT_LIST_HEAD(&p->nonfull_clusters[i]); > > + > > for (i = 0; i < swap_header->info.nr_badpages; i++) { > > unsigned int page_nr = swap_header->info.badpages[i]; > > if (page_nr == 0 || page_nr > swap_header->info.last_page) > > > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list 2024-06-07 20:52 ` Chris Li @ 2024-06-10 11:18 ` Ryan Roberts 2024-06-11 6:09 ` Chris Li 0 siblings, 1 reply; 43+ messages in thread From: Ryan Roberts @ 2024-06-10 11:18 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song On 07/06/2024 21:52, Chris Li wrote: > On Fri, Jun 7, 2024 at 3:35 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> On 24/05/2024 18:17, Chris Li wrote: >>> Track the nonfull cluster as well as the empty cluster >>> on lists. Each order has one nonfull cluster list. >>> >>> The cluster will remember which order it was used during >>> new cluster allocation. >>> >>> When the cluster has free entry, add to the nonfull[order] >>> list. When the free cluster list is empty, also allocate >>> from the nonempty list of that order. >>> >>> This improves the mTHP swap allocation success rate. >> >> If I've understood correctly, the aim here is to link all the current per-cpu >> clusters for a given order together so that if a cpu can't allocate a new >> cluster for a given order, then it can steal another CPU's current cluster for >> that order? > > Stealing other CPU's *current* cluster is not the intent. The intent > is after all current per-cpu done with this cluster(full), those full > clusters are not tracked by any per-cpu struct. When those full > clusters become non-full. Track it in the global nonfull cluster list. > The per-cpu allocation can take a cluster from that nonfull cluster > list and start allocating from it. > > The V1 code does not specifically check for the stealing behavior, the > V2 code will prevent that from happening. Basically each cluster has 4 > states and owners: > 1) empty, owned by a global free cluster list. > 2) per cpu allocating. owned by per CPU current. > 3) nonfull (also non empty). own global nonfull list. > 4) full, currently not tracked, we can track it under global full list. > > When the per cpu runs out of free cluster, it can take a cluster from > 3) and move it to 2). OK, sorry for my misunderstanding, and thanks for the explanaiton. I've taken a proper look at the patch and with this explanation now understand the intent. I guess in effect, this is a scanning approach, but you are limiting the clusters that you scan to those that were originally allocated for the required order and which are known to have some free space. I guess this will work more effectively with Barry's series that swaps in a whole large folio in one go, because it is more likely that holes of the required size will appear in the non-full clusters. But previous discussions concluded that it was not always going to be the right approach to swap-in large folios in one go (certainly not for higer orders). So I don't think you can (yet) rely on swap slots being freed as order-sized blocks. That said I still think that your approach should improve the situation even without Barry's series. In fact, why don't you put the cluster on the non-full list at cluster allocation time? Then you can also allow a cpu to steal from another cpu's current cluster (as I initially thought you were doing). I think this should work with pretty much the same logic? And improve chances of allocation without increasing chances of fragmentation? (more on this below). > >> >> If that's the intent, couldn't that be done just by iterating over the per-cpu, >> per-order cluster pointers? Then you don't need all the linked list churn > > Again, that is not the intent. > >> (althogh I like the linked list changes as a nice cleanup, I'm not sure the >> churn is neccessary for this change?). There would likely need to be some >> locking considerations, but it would also allow you to get access to the next >> entry within the cluster for allocation. >> >> However, fundamentally, I don't think this change solves the problem; it just >> takes a bit longer before the allocation fails. The real problem is >> fragmentation due to freeing individual pages from swap entries at different times. > > It definitely helps to find nonfull clusters quicker. Please take a > look at my above comment and read the patch again. > >> >> Wouldn't it be better to just extend scanning to support high order allocations? >> Then we can steal a high order block from any cluster, even clusters that were > > Steal from higher order causes the higher order harder to allocate, > that is downside. > In my mind, ideally have some high order cluster reservation scheme so > the high order one doesn't mix with the low order one. Yes, that would make sense; you could limit the number of clusters allocated for each order at any given time. Order-0 stealing will still cause problems. You could probably just remove that and limit order-0 scanning/stealing to clusters that were originally allocated for order-0 too, using the same logic. > >> previously full, just like we currently do for order-0. Given we are already >> falling back to this path for order-0, I don't think it would be any more >> expensive; infact its less expensive because we only scan once for the high >> order block, rather than scan for every split order-0 page. >> >> Of course that still doesn't solve the proplem entirely; if swap is so >> fragmented that there is no contiguous block of the required order then you >> still have to fall back to splitting. As an extra optimization, you could store > > Exactly. That is why I think some high order cluster reservation > scheme is needed for a short term solution. > The change itself is not too complicated if we can agree on this approach. > >> the largest contiguous free space available in each cluster to avoid scanning in >> case its too small? > > Avoid scanning does just get to the non available high order result quicker. > Does not seem to help increase the high order allocation success rate. > >> >> >>> >>> There are limitations if the distribution of numbers of >>> different orders of mTHP changes a lot. e.g. there are a lot >>> of nonfull cluster assign to order A while later time there >>> are a lot of order B allocation while very little allocation >>> in order A. Currently the cluster used by order A will not >>> reused by order B unless the cluster is 100% empty. >>> >>> This situation is best addressed by the longer term "swap >>> buddy allocator", in future patches. >>> --- >>> include/linux/swap.h | 4 ++++ >>> mm/swapfile.c | 25 +++++++++++++++++++++++-- >>> 2 files changed, 27 insertions(+), 2 deletions(-) >>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index 0d3906eff3c9..1b7f0794b9bf 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -255,10 +255,12 @@ struct swap_cluster_info { >>> * cluster >>> */ >>> unsigned int count:16; >>> + unsigned int order:8; >>> unsigned int flags:8; >>> struct list_head next; >>> }; >>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ >>> >>> >>> /* >>> @@ -297,6 +299,8 @@ struct swap_info_struct { >>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ >>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ >>> struct list_head free_clusters; /* free clusters list */ >>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; >>> + /* list of cluster that contains at least one free slot */ >>> unsigned int lowest_bit; /* index of first free in swap_map */ >>> unsigned int highest_bit; /* index of last free in swap_map */ >>> unsigned int pages; /* total of usable pages of swap */ >>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>> index 205a60c5f9cb..51923aba500e 100644 >>> --- a/mm/swapfile.c >>> +++ b/mm/swapfile.c >>> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, >>> >>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) >>> { >>> + if (ci->flags & CLUSTER_FLAG_NONFULL) >>> + list_move_tail(&ci->next, &si->free_clusters); >>> + else >>> + list_add_tail(&ci->next, &si->free_clusters); >>> ci->flags = CLUSTER_FLAG_FREE; >>> - list_add_tail(&ci->next, &si->free_clusters); >>> } >>> >>> /* >>> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste >>> ci->count--; >>> >>> if (!ci->count) >>> - free_cluster(p, ci); >>> + return free_cluster(p, ci); >>> + >>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >>> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); >>> + ci->flags |= CLUSTER_FLAG_NONFULL; >>> + } >>> } >>> >>> /* >>> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); >>> list_del(&ci->next); >>> spin_lock(&ci->lock); >>> + ci->order = order; >>> + ci->flags = 0; >>> + spin_unlock(&ci->lock); >>> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; >>> + } else if (!list_empty(&si->nonfull_clusters[order])) { You are preferring to scan the nonfull clusters over doing discard and allocating a newly freed cluster; wouldn't it be better to prefer discard over nonfull scanning? >>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); >>> + list_del(&ci->next); I'm struggling a bit with what the value of the nonfull_clusters linked list is. I wonder if it would be simpler to just track the number of free slots in the cluster and iterate over the clusters, scanning the ones with the desired order and which have at least (1 << order) free slots? I guess this list is giving you an ordering such that the cluster you pull off the list first had its first slot freed longest ago so it is most likely to have most space? >>> + spin_lock(&ci->lock); >>> ci->flags = 0; >>> spin_unlock(&ci->lock); >>> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; >> >> This looks wrong to me; if the cluster is on the nonfull list then it will have >> had some entries already allocated (by another cpu). So pointing tmp to the >> first block in the cluster will never yield a free block. The cpu from which you > > I believe it will scan until it finds a free block with alignment down > in the offset < max loop. Yes agreed, my bad. Sorry about that. Thanks, Ryan > > while (offset < max) { > if (swap_range_empty(si->swap_map, offset, nr_pages)) > break; > offset += nr_pages; > } > >> are stealing the cluster stores the next free block location in its per-cpu >> structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is >> a better approach than the nonfull list? > > No, stealing is not the intent. The intent is quickly finding the non > full cluster NOT in other per cpu allocation. > >> >> Additionally, this cluster will be stored back to this cpu's current cluster at >> the bottom of the function. That may or may not be what you intended. > > That is what I intended. It remembers the current allocating cluster, > in case this cluster has more than one high order swap entries. > > Chris > >> >>> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >>> break; >>> tmp += nr_pages; >>> } >>> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); >>> unlock_cluster(ci); >>> } >>> if (tmp >= max) { >>> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) >>> ci = lock_cluster(si, offset); >>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); >>> ci->count = 0; >>> + ci->order = 0; >>> ci->flags = 0; >>> free_cluster(si, ci); >>> unlock_cluster(ci); >>> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, >>> INIT_LIST_HEAD(&p->free_clusters); >>> INIT_LIST_HEAD(&p->discard_clusters); >>> >>> + for (i = 0; i < SWAP_NR_ORDERS; i++) >>> + INIT_LIST_HEAD(&p->nonfull_clusters[i]); >>> + >>> for (i = 0; i < swap_header->info.nr_badpages; i++) { >>> unsigned int page_nr = swap_header->info.badpages[i]; >>> if (page_nr == 0 || page_nr > swap_header->info.last_page) >>> >> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list 2024-06-10 11:18 ` Ryan Roberts @ 2024-06-11 6:09 ` Chris Li 0 siblings, 0 replies; 43+ messages in thread From: Chris Li @ 2024-06-11 6:09 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song On Mon, Jun 10, 2024 at 4:18 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 07/06/2024 21:52, Chris Li wrote: > > On Fri, Jun 7, 2024 at 3:35 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Stealing other CPU's *current* cluster is not the intent. The intent > > is after all current per-cpu done with this cluster(full), those full > > clusters are not tracked by any per-cpu struct. When those full > > clusters become non-full. Track it in the global nonfull cluster list. > > The per-cpu allocation can take a cluster from that nonfull cluster > > list and start allocating from it. > > > > The V1 code does not specifically check for the stealing behavior, the > > V2 code will prevent that from happening. Basically each cluster has 4 > > states and owners: > > 1) empty, owned by a global free cluster list. > > 2) per cpu allocating. owned by per CPU current. > > 3) nonfull (also non empty). own global nonfull list. > > 4) full, currently not tracked, we can track it under global full list. > > > > When the per cpu runs out of free cluster, it can take a cluster from > > 3) and move it to 2). > > OK, sorry for my misunderstanding, and thanks for the explanaiton. I've taken a > proper look at the patch and with this explanation now understand the intent. > > I guess in effect, this is a scanning approach, but you are limiting the > clusters that you scan to those that were originally allocated for the required > order and which are known to have some free space. Yes, not only some free space. When we swap in the same size as the swap out size, we can have the cluster dedicated to that order. Avoid mixing order in that cluster. > > I guess this will work more effectively with Barry's series that swaps in a > whole large folio in one go, because it is more likely that holes of the > required size will appear in the non-full clusters. But previous discussions Ack. > concluded that it was not always going to be the right approach to swap-in large > folios in one go (certainly not for higer orders). So I don't think you can We need to start from somewhere. Right now it is still too early to say which approach will win out. I think it is beneficial to try out swapin the same size as swap out, and see how the data play out. 4K is too small and 2M is too big. Maybe one of the mTHP size would be the sweet spot. > (yet) rely on swap slots being freed as order-sized blocks. That said I still > think that your approach should improve the situation even without Barry's series. Agree. > In fact, why don't you put the cluster on the non-full list at cluster > allocation time? Then you can also allow a cpu to steal from another cpu's That would be the reservation approach. If we know how much swap space we want for that order in advance. We can make the reservation at swap on time, then that order will only allocate from the reserved cluster. > current cluster (as I initially thought you were doing). I think this should For stealing from other cpu's current order cluster, we can have a global list of clusters for per_cpu[order] for the cluster. It's better to track it separately as the other nonfull cluster. We only steal from other CPU when we run out of global nonfull clusters. The current patch allows stealing from other CPU as well. Just might happen before it runs out of the nonfull cluster. Might not be too big a deal. > work with pretty much the same logic? And improve chances of allocation without > increasing chances of fragmentation? (more on this below). > > > > >> > >> If that's the intent, couldn't that be done just by iterating over the per-cpu, > >> per-order cluster pointers? Then you don't need all the linked list churn > > > > Again, that is not the intent. > > > >> (althogh I like the linked list changes as a nice cleanup, I'm not sure the > >> churn is neccessary for this change?). There would likely need to be some > >> locking considerations, but it would also allow you to get access to the next > >> entry within the cluster for allocation. > >> > >> However, fundamentally, I don't think this change solves the problem; it just > >> takes a bit longer before the allocation fails. The real problem is > >> fragmentation due to freeing individual pages from swap entries at different times. > > > > It definitely helps to find nonfull clusters quicker. Please take a > > look at my above comment and read the patch again. > > > >> > >> Wouldn't it be better to just extend scanning to support high order allocations? > >> Then we can steal a high order block from any cluster, even clusters that were > > > > Steal from higher order causes the higher order harder to allocate, > > that is downside. > > In my mind, ideally have some high order cluster reservation scheme so > > the high order one doesn't mix with the low order one. > > Yes, that would make sense; you could limit the number of clusters allocated for > each order at any given time. > > Order-0 stealing will still cause problems. You could probably just remove that > and limit order-0 scanning/stealing to clusters that were originally allocated > for order-0 too, using the same logic. Yes, that is the plan. If we reserve some space for higher order, then order-0 can't steal from the reserved higher order cluster. > > > > >> previously full, just like we currently do for order-0. Given we are already > >> falling back to this path for order-0, I don't think it would be any more > >> expensive; infact its less expensive because we only scan once for the high > >> order block, rather than scan for every split order-0 page. > >> > >> Of course that still doesn't solve the proplem entirely; if swap is so > >> fragmented that there is no contiguous block of the required order then you > >> still have to fall back to splitting. As an extra optimization, you could store > > > > Exactly. That is why I think some high order cluster reservation > > scheme is needed for a short term solution. > > The change itself is not too complicated if we can agree on this approach. > > > >> the largest contiguous free space available in each cluster to avoid scanning in > >> case its too small? > > > > Avoid scanning does just get to the non available high order result quicker. > > Does not seem to help increase the high order allocation success rate. > > > >> > >> > >>> > >>> There are limitations if the distribution of numbers of > >>> different orders of mTHP changes a lot. e.g. there are a lot > >>> of nonfull cluster assign to order A while later time there > >>> are a lot of order B allocation while very little allocation > >>> in order A. Currently the cluster used by order A will not > >>> reused by order B unless the cluster is 100% empty. > >>> > >>> This situation is best addressed by the longer term "swap > >>> buddy allocator", in future patches. > >>> --- > >>> include/linux/swap.h | 4 ++++ > >>> mm/swapfile.c | 25 +++++++++++++++++++++++-- > >>> 2 files changed, 27 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/include/linux/swap.h b/include/linux/swap.h > >>> index 0d3906eff3c9..1b7f0794b9bf 100644 > >>> --- a/include/linux/swap.h > >>> +++ b/include/linux/swap.h > >>> @@ -255,10 +255,12 @@ struct swap_cluster_info { > >>> * cluster > >>> */ > >>> unsigned int count:16; > >>> + unsigned int order:8; > >>> unsigned int flags:8; > >>> struct list_head next; > >>> }; > >>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ > >>> > >>> > >>> /* > >>> @@ -297,6 +299,8 @@ struct swap_info_struct { > >>> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ > >>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ > >>> struct list_head free_clusters; /* free clusters list */ > >>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; > >>> + /* list of cluster that contains at least one free slot */ > >>> unsigned int lowest_bit; /* index of first free in swap_map */ > >>> unsigned int highest_bit; /* index of last free in swap_map */ > >>> unsigned int pages; /* total of usable pages of swap */ > >>> diff --git a/mm/swapfile.c b/mm/swapfile.c > >>> index 205a60c5f9cb..51923aba500e 100644 > >>> --- a/mm/swapfile.c > >>> +++ b/mm/swapfile.c > >>> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > >>> > >>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > >>> { > >>> + if (ci->flags & CLUSTER_FLAG_NONFULL) > >>> + list_move_tail(&ci->next, &si->free_clusters); > >>> + else > >>> + list_add_tail(&ci->next, &si->free_clusters); > >>> ci->flags = CLUSTER_FLAG_FREE; > >>> - list_add_tail(&ci->next, &si->free_clusters); > >>> } > >>> > >>> /* > >>> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste > >>> ci->count--; > >>> > >>> if (!ci->count) > >>> - free_cluster(p, ci); > >>> + return free_cluster(p, ci); > >>> + > >>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { > >>> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); > >>> + ci->flags |= CLUSTER_FLAG_NONFULL; > >>> + } > >>> } > >>> > >>> /* > >>> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > >>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); > >>> list_del(&ci->next); > >>> spin_lock(&ci->lock); > >>> + ci->order = order; > >>> + ci->flags = 0; > >>> + spin_unlock(&ci->lock); > >>> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > >>> + } else if (!list_empty(&si->nonfull_clusters[order])) { > > You are preferring to scan the nonfull clusters over doing discard and > allocating a newly freed cluster; wouldn't it be better to prefer discard over > nonfull scanning? My consideration is that issuing a discard command takes some IO time. If we have an alternative path to move forward, e.g. using another nonfull cluster, we should do that to avoid latency penalty. > > >>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); > >>> + list_del(&ci->next); > > I'm struggling a bit with what the value of the nonfull_clusters linked list is. Assume we have the swap in size the same as swap out size. The nonfull_clusters will get to the cluster in the desired order quicker. > I wonder if it would be simpler to just track the number of free slots in the > cluster and iterate over the clusters, scanning the ones with the desired order > and which have at least (1 << order) free slots? I guess this list is giving you That would be slower than this approach. We would likely repeatedly scan over a lot of lower order clusters which can't find a big enough free range. We can cache it and speed it up by avoiding repeat scanning the unfitting cluster over and over again, that effectively turns into a buddy allocator approach then. It can work but is more complex. > an ordering such that the cluster you pull off the list first had its first slot > freed longest ago so it is most likely to have most space? Yes, that is one of the considerations as well. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-24 17:17 [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li 2024-05-24 17:17 ` [PATCH 1/2] mm: swap: swap cluster switch to double link list Chris Li 2024-05-24 17:17 ` [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li @ 2024-05-28 3:07 ` Barry Song 2024-05-28 21:04 ` Chris Li 2024-06-07 9:43 ` Ryan Roberts 4 siblings, 0 replies; 43+ messages in thread From: Barry Song @ 2024-05-28 3:07 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 4827 bytes --] On Sat, May 25, 2024 at 5:17 AM Chris Li <chrisl@kernel.org> wrote: > > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. That works well for PMD size THP, > but it has a serius fragmentation issue reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The mTHP allocation failure rate raises to almost 100% after a few > hours in Barry's test run. > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Address this by remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. > > This greatly improve the sucess rate of the mTHP swap allocation. > While I am still waiting for Barry's test result. I paste Kairui's test Hi Chris, Attached are the test results from a real phone using 4-order mTHP. The results seem better overall, but after 7 hours, especially when the swap device becomes full(soon some apps are killed to free memory and swap), the fallback ratio still reaches 100%. I haven't debugged this, but my guess is that the cluster's order can shift between 4-order and 0-order. Sometimes, they all shift to 0-order, and hardly can they get back to 4-order. > result here: > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > swapoff -a > mkswap /dev/ram0 > swapon /dev/ram0 > > rmdir /sys/fs/cgroup/benchmark > mkdir -p /sys/fs/cgroup/benchmark > cd /sys/fs/cgroup/benchmark > echo 8G > memory.max > echo $$ > cgroup.procs > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > --ratio 1:0 --pipeline 8 -d 1024 > > Before: > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > After: > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > And the fallback ratio dropped by a lot: > Before: > hugepages-32kB/stats/anon_swpout_fallback:15997 > hugepages-32kB/stats/anon_swpout:18712 > hugepages-512kB/stats/anon_swpout_fallback:192 > hugepages-512kB/stats/anon_swpout:0 > hugepages-2048kB/stats/anon_swpout_fallback:2 > hugepages-2048kB/stats/anon_swpout:0 > hugepages-1024kB/stats/anon_swpout_fallback:0 > hugepages-1024kB/stats/anon_swpout:0 > hugepages-64kB/stats/anon_swpout_fallback:18246 > hugepages-64kB/stats/anon_swpout:17644 > hugepages-16kB/stats/anon_swpout_fallback:13701 > hugepages-16kB/stats/anon_swpout:18234 > hugepages-256kB/stats/anon_swpout_fallback:8642 > hugepages-256kB/stats/anon_swpout:93 > hugepages-128kB/stats/anon_swpout_fallback:21497 > hugepages-128kB/stats/anon_swpout:7596 > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > After: > hugepages-32kB/stats/swpout:34445 > hugepages-32kB/stats/swpout_fallback:0 > hugepages-512kB/stats/swpout:1 > hugepages-512kB/stats/swpout_fallback:134 > hugepages-2048kB/stats/swpout:1 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:6 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35495 > hugepages-64kB/stats/swpout_fallback:0 > hugepages-16kB/stats/swpout:32441 > hugepages-16kB/stats/swpout_fallback:0 > hugepages-256kB/stats/swpout:2223 > hugepages-256kB/stats/swpout_fallback:6278 > hugepages-128kB/stats/swpout:29136 > hugepages-128kB/stats/swpout_fallback:52 > > Reported-by: Barry Song <21cnbao@gmail.com> > Tested-by: Kairui Song <kasong@tencent.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: swap: swap cluster switch to double link list > mm: swap: mTHP allocate swap entries from nonfull list > > include/linux/swap.h | 18 ++-- > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > 2 files changed, 93 insertions(+), 177 deletions(-) > --- > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > change-id: 20240523-swap-allocator-1534c480ece4 > > Best regards, > -- > Chris Li <chrisl@kernel.org> > Thanks Barry [-- Attachment #2: chris-swap-patch.png --] [-- Type: image/png, Size: 25106 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-24 17:17 [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li ` (2 preceding siblings ...) 2024-05-28 3:07 ` [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Barry Song @ 2024-05-28 21:04 ` Chris Li 2024-05-29 8:55 ` Huang, Ying 2024-05-30 7:49 ` Barry Song 2024-06-07 9:43 ` Ryan Roberts 4 siblings, 2 replies; 43+ messages in thread From: Chris Li @ 2024-05-28 21:04 UTC (permalink / raw) To: Andrew Morton Cc: Kairui Song, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm, Barry Song I am spinning a new version for this series to address two issues found in this series: 1) Oppo discovered a bug in the following line: + ci = si->cluster_info + tmp; Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". That is a serious bug but trivial to fix. 2) order 0 allocation currently blindly scans swap_map disregarding the cluster->order. Given enough order 0 swap allocations(close to the swap file size) the order 0 allocation head will eventually sweep across the whole swapfile and destroy other cluster order allocations. The short term fix is just skipping clusters that are already assigned to higher orders. In the long term, I want to unify the non-SSD to use clusters for locking and allocations as well, just try to follow the last allocation (less seeking) as much as possible. Chris On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: > > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. That works well for PMD size THP, > but it has a serius fragmentation issue reported by Barry. > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The mTHP allocation failure rate raises to almost 100% after a few > hours in Barry's test run. > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Address this by remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. > > This greatly improve the sucess rate of the mTHP swap allocation. > While I am still waiting for Barry's test result. I paste Kairui's test > result here: > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > swapoff -a > mkswap /dev/ram0 > swapon /dev/ram0 > > rmdir /sys/fs/cgroup/benchmark > mkdir -p /sys/fs/cgroup/benchmark > cd /sys/fs/cgroup/benchmark > echo 8G > memory.max > echo $$ > cgroup.procs > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > --ratio 1:0 --pipeline 8 -d 1024 > > Before: > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > After: > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > And the fallback ratio dropped by a lot: > Before: > hugepages-32kB/stats/anon_swpout_fallback:15997 > hugepages-32kB/stats/anon_swpout:18712 > hugepages-512kB/stats/anon_swpout_fallback:192 > hugepages-512kB/stats/anon_swpout:0 > hugepages-2048kB/stats/anon_swpout_fallback:2 > hugepages-2048kB/stats/anon_swpout:0 > hugepages-1024kB/stats/anon_swpout_fallback:0 > hugepages-1024kB/stats/anon_swpout:0 > hugepages-64kB/stats/anon_swpout_fallback:18246 > hugepages-64kB/stats/anon_swpout:17644 > hugepages-16kB/stats/anon_swpout_fallback:13701 > hugepages-16kB/stats/anon_swpout:18234 > hugepages-256kB/stats/anon_swpout_fallback:8642 > hugepages-256kB/stats/anon_swpout:93 > hugepages-128kB/stats/anon_swpout_fallback:21497 > hugepages-128kB/stats/anon_swpout:7596 > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > After: > hugepages-32kB/stats/swpout:34445 > hugepages-32kB/stats/swpout_fallback:0 > hugepages-512kB/stats/swpout:1 > hugepages-512kB/stats/swpout_fallback:134 > hugepages-2048kB/stats/swpout:1 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:6 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35495 > hugepages-64kB/stats/swpout_fallback:0 > hugepages-16kB/stats/swpout:32441 > hugepages-16kB/stats/swpout_fallback:0 > hugepages-256kB/stats/swpout:2223 > hugepages-256kB/stats/swpout_fallback:6278 > hugepages-128kB/stats/swpout:29136 > hugepages-128kB/stats/swpout_fallback:52 > > Reported-by: Barry Song <21cnbao@gmail.com> > Tested-by: Kairui Song <kasong@tencent.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: swap: swap cluster switch to double link list > mm: swap: mTHP allocate swap entries from nonfull list > > include/linux/swap.h | 18 ++-- > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > 2 files changed, 93 insertions(+), 177 deletions(-) > --- > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > change-id: 20240523-swap-allocator-1534c480ece4 > > Best regards, > -- > Chris Li <chrisl@kernel.org> > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-28 21:04 ` Chris Li @ 2024-05-29 8:55 ` Huang, Ying 2024-05-30 1:13 ` Chris Li 2024-05-30 7:49 ` Barry Song 1 sibling, 1 reply; 43+ messages in thread From: Huang, Ying @ 2024-05-29 8:55 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > I am spinning a new version for this series to address two issues > found in this series: > > 1) Oppo discovered a bug in the following line: > + ci = si->cluster_info + tmp; > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > That is a serious bug but trivial to fix. > > 2) order 0 allocation currently blindly scans swap_map disregarding > the cluster->order. IIUC, now, we only scan swap_map[] only if !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). That is, if you doesn't run low swap free space, you will not do that. > Given enough order 0 swap allocations(close to the > swap file size) the order 0 allocation head will eventually sweep > across the whole swapfile and destroy other cluster order allocations. > > The short term fix is just skipping clusters that are already assigned > to higher orders. Better to do any further optimization on top of the simpler one. Need to evaluate whether it's necessary to add more complexity. > In the long term, I want to unify the non-SSD to use clusters for > locking and allocations as well, just try to follow the last > allocation (less seeking) as much as possible. I have thought about that too. Personally, I think that it's good to remove swap_map[] scanning. The implementation can be simplified too. I don't know whether do we need to consider the performance of HDD swap now. -- Best Regards, Huang, Ying > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: >> >> This is the short term solutiolns "swap cluster order" listed >> in my "Swap Abstraction" discussion slice 8 in the recent >> LSF/MM conference. >> >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> orders" is introduced, it only allocates the mTHP swap entries >> from new empty cluster list. That works well for PMD size THP, >> but it has a serius fragmentation issue reported by Barry. >> >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ >> >> The mTHP allocation failure rate raises to almost 100% after a few >> hours in Barry's test run. >> >> The reason is that all the empty cluster has been exhausted while >> there are planty of free swap entries to in the cluster that is >> not 100% free. >> >> Address this by remember the swap allocation order in the cluster. >> Keep track of the per order non full cluster list for later allocation. >> >> This greatly improve the sucess rate of the mTHP swap allocation. >> While I am still waiting for Barry's test result. I paste Kairui's test >> result here: >> >> I'm able to reproduce such an issue with a simple script (enabling all order of mthp): >> >> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) >> swapoff -a >> mkswap /dev/ram0 >> swapon /dev/ram0 >> >> rmdir /sys/fs/cgroup/benchmark >> mkdir -p /sys/fs/cgroup/benchmark >> cd /sys/fs/cgroup/benchmark >> echo 8G > memory.max >> echo $$ > cgroup.procs >> >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & >> >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ >> -P memcache_binary -n allkeys --key-minimum=1 \ >> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ >> --ratio 1:0 --pipeline 8 -d 1024 >> >> Before: >> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 >> After: >> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 >> >> And the fallback ratio dropped by a lot: >> Before: >> hugepages-32kB/stats/anon_swpout_fallback:15997 >> hugepages-32kB/stats/anon_swpout:18712 >> hugepages-512kB/stats/anon_swpout_fallback:192 >> hugepages-512kB/stats/anon_swpout:0 >> hugepages-2048kB/stats/anon_swpout_fallback:2 >> hugepages-2048kB/stats/anon_swpout:0 >> hugepages-1024kB/stats/anon_swpout_fallback:0 >> hugepages-1024kB/stats/anon_swpout:0 >> hugepages-64kB/stats/anon_swpout_fallback:18246 >> hugepages-64kB/stats/anon_swpout:17644 >> hugepages-16kB/stats/anon_swpout_fallback:13701 >> hugepages-16kB/stats/anon_swpout:18234 >> hugepages-256kB/stats/anon_swpout_fallback:8642 >> hugepages-256kB/stats/anon_swpout:93 >> hugepages-128kB/stats/anon_swpout_fallback:21497 >> hugepages-128kB/stats/anon_swpout:7596 >> >> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) >> >> After: >> hugepages-32kB/stats/swpout:34445 >> hugepages-32kB/stats/swpout_fallback:0 >> hugepages-512kB/stats/swpout:1 >> hugepages-512kB/stats/swpout_fallback:134 >> hugepages-2048kB/stats/swpout:1 >> hugepages-2048kB/stats/swpout_fallback:1 >> hugepages-1024kB/stats/swpout:6 >> hugepages-1024kB/stats/swpout_fallback:0 >> hugepages-64kB/stats/swpout:35495 >> hugepages-64kB/stats/swpout_fallback:0 >> hugepages-16kB/stats/swpout:32441 >> hugepages-16kB/stats/swpout_fallback:0 >> hugepages-256kB/stats/swpout:2223 >> hugepages-256kB/stats/swpout_fallback:6278 >> hugepages-128kB/stats/swpout:29136 >> hugepages-128kB/stats/swpout_fallback:52 >> >> Reported-by: Barry Song <21cnbao@gmail.com> >> Tested-by: Kairui Song <kasong@tencent.com> >> Signed-off-by: Chris Li <chrisl@kernel.org> >> --- >> Chris Li (2): >> mm: swap: swap cluster switch to double link list >> mm: swap: mTHP allocate swap entries from nonfull list >> >> include/linux/swap.h | 18 ++-- >> mm/swapfile.c | 252 +++++++++++++++++---------------------------------- >> 2 files changed, 93 insertions(+), 177 deletions(-) >> --- >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed >> change-id: 20240523-swap-allocator-1534c480ece4 >> >> Best regards, >> -- >> Chris Li <chrisl@kernel.org> >> ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-29 8:55 ` Huang, Ying @ 2024-05-30 1:13 ` Chris Li 2024-05-30 2:52 ` Huang, Ying 0 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-05-30 1:13 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Hi Ying, On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > I am spinning a new version for this series to address two issues > > found in this series: > > > > 1) Oppo discovered a bug in the following line: > > + ci = si->cluster_info + tmp; > > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > > That is a serious bug but trivial to fix. > > > > 2) order 0 allocation currently blindly scans swap_map disregarding > > the cluster->order. > > IIUC, now, we only scan swap_map[] only if > !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > That is, if you doesn't run low swap free space, you will not do that. You can still swap space in order 0 clusters while order 4 runs out of free_cluster or nonfull_clusters[order]. For Android that is a common case. > > > Given enough order 0 swap allocations(close to the > > swap file size) the order 0 allocation head will eventually sweep > > across the whole swapfile and destroy other cluster order allocations. > > > > The short term fix is just skipping clusters that are already assigned > > to higher orders. > > Better to do any further optimization on top of the simpler one. Need > to evaluate whether it's necessary to add more complexity. I agree this needs more careful planning and discussion. In Android's use case, the swapfile is always almost full. It will run into this situation after long enough swap time. Once the order 0 swap entry starts to pollute the higher order cluster, there is no going back(until the whole cluster is 100% free). > > In the long term, I want to unify the non-SSD to use clusters for > > locking and allocations as well, just try to follow the last > > allocation (less seeking) as much as possible. > > I have thought about that too. Personally, I think that it's good to > remove swap_map[] scanning. The implementation can be simplified too. Agree. I look at the commit that introduces the SSD cluster. The commit message indicates a lot of CPU time spent in swap_map scanning, especially when the swapfile is almost full. The main motivation to introduce the cluster in HDD is to simplify and unify the code. > I don't know whether do we need to consider the performance of HDD swap > now. I am not sure about that either. We can make the best effort to reduce the seek. Chris > -- > Best Regards, > Huang, Ying > > > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: > >> > >> This is the short term solutiolns "swap cluster order" listed > >> in my "Swap Abstraction" discussion slice 8 in the recent > >> LSF/MM conference. > >> > >> When commit 845982eb264bc "mm: swap: allow storage of all mTHP > >> orders" is introduced, it only allocates the mTHP swap entries > >> from new empty cluster list. That works well for PMD size THP, > >> but it has a serius fragmentation issue reported by Barry. > >> > >> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > >> > >> The mTHP allocation failure rate raises to almost 100% after a few > >> hours in Barry's test run. > >> > >> The reason is that all the empty cluster has been exhausted while > >> there are planty of free swap entries to in the cluster that is > >> not 100% free. > >> > >> Address this by remember the swap allocation order in the cluster. > >> Keep track of the per order non full cluster list for later allocation. > >> > >> This greatly improve the sucess rate of the mTHP swap allocation. > >> While I am still waiting for Barry's test result. I paste Kairui's test > >> result here: > >> > >> I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > >> > >> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > >> swapoff -a > >> mkswap /dev/ram0 > >> swapon /dev/ram0 > >> > >> rmdir /sys/fs/cgroup/benchmark > >> mkdir -p /sys/fs/cgroup/benchmark > >> cd /sys/fs/cgroup/benchmark > >> echo 8G > memory.max > >> echo $$ > cgroup.procs > >> > >> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > >> > >> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > >> -P memcache_binary -n allkeys --key-minimum=1 \ > >> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > >> --ratio 1:0 --pipeline 8 -d 1024 > >> > >> Before: > >> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > >> After: > >> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > >> > >> And the fallback ratio dropped by a lot: > >> Before: > >> hugepages-32kB/stats/anon_swpout_fallback:15997 > >> hugepages-32kB/stats/anon_swpout:18712 > >> hugepages-512kB/stats/anon_swpout_fallback:192 > >> hugepages-512kB/stats/anon_swpout:0 > >> hugepages-2048kB/stats/anon_swpout_fallback:2 > >> hugepages-2048kB/stats/anon_swpout:0 > >> hugepages-1024kB/stats/anon_swpout_fallback:0 > >> hugepages-1024kB/stats/anon_swpout:0 > >> hugepages-64kB/stats/anon_swpout_fallback:18246 > >> hugepages-64kB/stats/anon_swpout:17644 > >> hugepages-16kB/stats/anon_swpout_fallback:13701 > >> hugepages-16kB/stats/anon_swpout:18234 > >> hugepages-256kB/stats/anon_swpout_fallback:8642 > >> hugepages-256kB/stats/anon_swpout:93 > >> hugepages-128kB/stats/anon_swpout_fallback:21497 > >> hugepages-128kB/stats/anon_swpout:7596 > >> > >> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > >> > >> After: > >> hugepages-32kB/stats/swpout:34445 > >> hugepages-32kB/stats/swpout_fallback:0 > >> hugepages-512kB/stats/swpout:1 > >> hugepages-512kB/stats/swpout_fallback:134 > >> hugepages-2048kB/stats/swpout:1 > >> hugepages-2048kB/stats/swpout_fallback:1 > >> hugepages-1024kB/stats/swpout:6 > >> hugepages-1024kB/stats/swpout_fallback:0 > >> hugepages-64kB/stats/swpout:35495 > >> hugepages-64kB/stats/swpout_fallback:0 > >> hugepages-16kB/stats/swpout:32441 > >> hugepages-16kB/stats/swpout_fallback:0 > >> hugepages-256kB/stats/swpout:2223 > >> hugepages-256kB/stats/swpout_fallback:6278 > >> hugepages-128kB/stats/swpout:29136 > >> hugepages-128kB/stats/swpout_fallback:52 > >> > >> Reported-by: Barry Song <21cnbao@gmail.com> > >> Tested-by: Kairui Song <kasong@tencent.com> > >> Signed-off-by: Chris Li <chrisl@kernel.org> > >> --- > >> Chris Li (2): > >> mm: swap: swap cluster switch to double link list > >> mm: swap: mTHP allocate swap entries from nonfull list > >> > >> include/linux/swap.h | 18 ++-- > >> mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > >> 2 files changed, 93 insertions(+), 177 deletions(-) > >> --- > >> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > >> change-id: 20240523-swap-allocator-1534c480ece4 > >> > >> Best regards, > >> -- > >> Chris Li <chrisl@kernel.org> > >> > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-30 1:13 ` Chris Li @ 2024-05-30 2:52 ` Huang, Ying 2024-05-30 8:08 ` Kairui Song 2024-05-30 21:44 ` Chris Li 0 siblings, 2 replies; 43+ messages in thread From: Huang, Ying @ 2024-05-30 2:52 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > Hi Ying, > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > I am spinning a new version for this series to address two issues >> > found in this series: >> > >> > 1) Oppo discovered a bug in the following line: >> > + ci = si->cluster_info + tmp; >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> > That is a serious bug but trivial to fix. >> > >> > 2) order 0 allocation currently blindly scans swap_map disregarding >> > the cluster->order. >> >> IIUC, now, we only scan swap_map[] only if >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). >> That is, if you doesn't run low swap free space, you will not do that. > > You can still swap space in order 0 clusters while order 4 runs out of > free_cluster > or nonfull_clusters[order]. For Android that is a common case. When we fail to allocate order 4, we will fallback to order 0. Still don't need to scan swap_map[]. But after looking at your below reply, I realized that the swap space is almost full at most times in your cases. Then, it's possible that we run into scanning swap_map[]. list_empty(&si->free_clusters) && list_empty(&si->nonfull_clusters[order]) will become true, if we put too many clusters in si->percpu_cluster. So, if we want to avoid to scan swap_map[], we can stop add clusters in si->percpu_cluster when swap space runs low. And maybe take clusters out of si->percpu_cluster sometimes. Another issue is nonfull_cluster[order1] cannot be used for nonfull_cluster[order2]. In definition, we should not fail order 0 allocation, we need to steal nonfull_cluster[order>0] for order 0 allocation. This can avoid to scan swap_map[] too. This may be not perfect, but it is the simplest first step implementation. You can optimize based on it further. And, I checked your code again. It appears that si->percpu_cluster may be put in si->nonfull_cluster[], then be used by another CPU. Please check it. -- Best Regards, Huang, Ying [snip] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-30 2:52 ` Huang, Ying @ 2024-05-30 8:08 ` Kairui Song 2024-05-30 18:31 ` Chris Li 2024-05-30 21:44 ` Chris Li 1 sibling, 1 reply; 43+ messages in thread From: Kairui Song @ 2024-05-30 8:08 UTC (permalink / raw) To: Huang, Ying Cc: Chris Li, Andrew Morton, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Thu, May 30, 2024 at 10:54 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > Hi Ying, > > > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > I am spinning a new version for this series to address two issues > >> > found in this series: > >> > > >> > 1) Oppo discovered a bug in the following line: > >> > + ci = si->cluster_info + tmp; > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> > That is a serious bug but trivial to fix. > >> > > >> > 2) order 0 allocation currently blindly scans swap_map disregarding > >> > the cluster->order. > >> > >> IIUC, now, we only scan swap_map[] only if > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > >> That is, if you doesn't run low swap free space, you will not do that. > > > > You can still swap space in order 0 clusters while order 4 runs out of > > free_cluster > > or nonfull_clusters[order]. For Android that is a common case. > > When we fail to allocate order 4, we will fallback to order 0. Still > don't need to scan swap_map[]. But after looking at your below reply, I > realized that the swap space is almost full at most times in your cases. > Then, it's possible that we run into scanning swap_map[]. > list_empty(&si->free_clusters) && > list_empty(&si->nonfull_clusters[order]) will become true, if we put too > many clusters in si->percpu_cluster. So, if we want to avoid to scan > swap_map[], we can stop add clusters in si->percpu_cluster when swap > space runs low. And maybe take clusters out of si->percpu_cluster > sometimes. Stop adding when it runs low seems too late, there could still be a free cluster stuck on a CPU, and not getting scanned, right? > Another issue is nonfull_cluster[order1] cannot be used for > nonfull_cluster[order2]. In definition, we should not fail order 0 > allocation, we need to steal nonfull_cluster[order>0] for order 0 > allocation. This can avoid to scan swap_map[] too. This may be not > perfect, but it is the simplest first step implementation. You can > optimize based on it further. This can be extended to allow any order < MAX_ORDER to steal from higher order, which might increase fragmentation though. So this is looking more and more like a buddy allocator, and that should be the long term solution. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-30 8:08 ` Kairui Song @ 2024-05-30 18:31 ` Chris Li 0 siblings, 0 replies; 43+ messages in thread From: Chris Li @ 2024-05-30 18:31 UTC (permalink / raw) To: Kairui Song Cc: Huang, Ying, Andrew Morton, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Thu, May 30, 2024 at 1:08 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, May 30, 2024 at 10:54 AM Huang, Ying <ying.huang@intel.com> wrote: > > > > Chris Li <chrisl@kernel.org> writes: > > > > > Hi Ying, > > > > > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > > >> > > >> Chris Li <chrisl@kernel.org> writes: > > >> > > >> > I am spinning a new version for this series to address two issues > > >> > found in this series: > > >> > > > >> > 1) Oppo discovered a bug in the following line: > > >> > + ci = si->cluster_info + tmp; > > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > > >> > That is a serious bug but trivial to fix. > > >> > > > >> > 2) order 0 allocation currently blindly scans swap_map disregarding > > >> > the cluster->order. > > >> > > >> IIUC, now, we only scan swap_map[] only if > > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > > >> That is, if you doesn't run low swap free space, you will not do that. > > > > > > You can still swap space in order 0 clusters while order 4 runs out of > > > free_cluster > > > or nonfull_clusters[order]. For Android that is a common case. > > > > When we fail to allocate order 4, we will fallback to order 0. Still > > don't need to scan swap_map[]. But after looking at your below reply, I > > realized that the swap space is almost full at most times in your cases. > > Then, it's possible that we run into scanning swap_map[]. > > list_empty(&si->free_clusters) && > > list_empty(&si->nonfull_clusters[order]) will become true, if we put too > > many clusters in si->percpu_cluster. So, if we want to avoid to scan > > swap_map[], we can stop add clusters in si->percpu_cluster when swap > > space runs low. And maybe take clusters out of si->percpu_cluster > > sometimes. > > Stop adding when it runs low seems too late, there could still be a > free cluster stuck on a CPU, and not getting scanned, right? The free clusters stuck on the CPU are a small number. Only a handful of clusters. Preventing low order swap polluting the high order cluster is more urgent. > > > Another issue is nonfull_cluster[order1] cannot be used for > > nonfull_cluster[order2]. In definition, we should not fail order 0 > > allocation, we need to steal nonfull_cluster[order>0] for order 0 > > allocation. This can avoid to scan swap_map[] too. This may be not > > perfect, but it is the simplest first step implementation. You can > > optimize based on it further. > > This can be extended to allow any order < MAX_ORDER to steal from > higher order, which might increase fragmentation though. Steal from higher order is a bad thing. Because the value of the allocator is able to allocate from higher order. High to low is always trivil, the low to high is impossible. See the other email having a "knob" to reserve some swap space for high order allocations. That is not perfect but more useful. > > So this is looking more and more like a buddy allocator, and that > should be the long term solution. > In Barry's test case, there is a huge swing of order 0 and order 4 allocation caused by the low memory killer. Apps get killed and take a while for the app to launch and swap out high order entries. The buddy allocator will have limited help there because once cluster is used for order 0, the fragmentation will prevent higher order allocation. Buddy allocator might not be able to help much in this situation. We do need a way to swap out large folios using discontiguous swap entries. That is the longer term solution to Barry's usage situation. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-30 2:52 ` Huang, Ying 2024-05-30 8:08 ` Kairui Song @ 2024-05-30 21:44 ` Chris Li 2024-05-31 2:35 ` Huang, Ying 1 sibling, 1 reply; 43+ messages in thread From: Chris Li @ 2024-05-30 21:44 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > Hi Ying, > > > > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > I am spinning a new version for this series to address two issues > >> > found in this series: > >> > > >> > 1) Oppo discovered a bug in the following line: > >> > + ci = si->cluster_info + tmp; > >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> > That is a serious bug but trivial to fix. > >> > > >> > 2) order 0 allocation currently blindly scans swap_map disregarding > >> > the cluster->order. > >> > >> IIUC, now, we only scan swap_map[] only if > >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > >> That is, if you doesn't run low swap free space, you will not do that. > > > > You can still swap space in order 0 clusters while order 4 runs out of > > free_cluster > > or nonfull_clusters[order]. For Android that is a common case. > > When we fail to allocate order 4, we will fallback to order 0. Still > don't need to scan swap_map[]. But after looking at your below reply, I > realized that the swap space is almost full at most times in your cases. > Then, it's possible that we run into scanning swap_map[]. > list_empty(&si->free_clusters) && > list_empty(&si->nonfull_clusters[order]) will become true, if we put too > many clusters in si->percpu_cluster. So, if we want to avoid to scan > swap_map[], we can stop add clusters in si->percpu_cluster when swap > space runs low. And maybe take clusters out of si->percpu_cluster > sometimes. One idea after reading your reply. If we run out of the nonfull_cluster[order], we should be able to use other cpu's si->percpu_cluster[] as well. That is a very small win for Android, because android does not have too many cpu. We are talking about a handful of clusters, which might not justify the code complexity. It does not change the behavior that order 0 can pollut higher order. > > Another issue is nonfull_cluster[order1] cannot be used for > nonfull_cluster[order2]. In definition, we should not fail order 0 > allocation, we need to steal nonfull_cluster[order>0] for order 0 > allocation. This can avoid to scan swap_map[] too. This may be not > perfect, but it is the simplest first step implementation. You can > optimize based on it further. Yes, that is listed as the limitation of this cluster order approach. Initially we need to support one order well first. We might choose what order that is, 16K or 64K folio. 4K pages are too small, 2M pages are too big. The sweet spot might be some there in between. If we can support one order well, we can demonstrate the value of the mTHP. We can worry about other mix orders later. Do you have any suggestions for how to prevent the order 0 polluting the higher order cluster? If we allow that to happen, then it defeats the goal of being able to allocate higher order swap entries. The tricky question is we don't know how much swap space we should reserve for each order. We can always break higher order clusters to lower order, but can't do the reserves. The current patch series lets the actual usage determine the percentage of the cluster for each order. However that seems not enough for the test case Barry has. When the app gets OOM kill that is where a large swing of order 0 swap will show up and not enough higher order usage for the brief moment. The order 0 swap entry will pollute the high order cluster. We are currently debating a "knob" to be able to reserve a certain % of swap space for a certain order. Those reservations will be guaranteed and order 0 swap entry can't pollute them even when it runs out of swap space. That can make the mTHP at least usable for the Android case. Do you see another way to protect the high order cluster polluted by lower order one? > > And, I checked your code again. It appears that si->percpu_cluster may > be put in si->nonfull_cluster[], then be used by another CPU. Please > check it. Ah, good point. I think it does. Let me take a closer look. Chris Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-30 21:44 ` Chris Li @ 2024-05-31 2:35 ` Huang, Ying 2024-05-31 12:40 ` Kairui Song 2024-06-05 7:08 ` Chris Li 0 siblings, 2 replies; 43+ messages in thread From: Huang, Ying @ 2024-05-31 2:35 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > Hi Ying, >> > >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > I am spinning a new version for this series to address two issues >> >> > found in this series: >> >> > >> >> > 1) Oppo discovered a bug in the following line: >> >> > + ci = si->cluster_info + tmp; >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> >> > That is a serious bug but trivial to fix. >> >> > >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding >> >> > the cluster->order. >> >> >> >> IIUC, now, we only scan swap_map[] only if >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). >> >> That is, if you doesn't run low swap free space, you will not do that. >> > >> > You can still swap space in order 0 clusters while order 4 runs out of >> > free_cluster >> > or nonfull_clusters[order]. For Android that is a common case. >> >> When we fail to allocate order 4, we will fallback to order 0. Still >> don't need to scan swap_map[]. But after looking at your below reply, I >> realized that the swap space is almost full at most times in your cases. >> Then, it's possible that we run into scanning swap_map[]. >> list_empty(&si->free_clusters) && >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too >> many clusters in si->percpu_cluster. So, if we want to avoid to scan >> swap_map[], we can stop add clusters in si->percpu_cluster when swap >> space runs low. And maybe take clusters out of si->percpu_cluster >> sometimes. > > One idea after reading your reply. If we run out of the > nonfull_cluster[order], we should be able to use other cpu's > si->percpu_cluster[] as well. That is a very small win for Android, This will be useful in general. The number CPU may be large, and multiple orders may be used. > because android does not have too many cpu. We are talking about a > handful of clusters, which might not justify the code complexity. It > does not change the behavior that order 0 can pollut higher order. I have a feeling that you don't really know why swap_map[] is scanned. I suggest you to do more test and tracing to find out the reason. I suspect that there are some non-full cluster collection issues. >> Another issue is nonfull_cluster[order1] cannot be used for >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> allocation. This can avoid to scan swap_map[] too. This may be not >> perfect, but it is the simplest first step implementation. You can >> optimize based on it further. > > Yes, that is listed as the limitation of this cluster order approach. > Initially we need to support one order well first. We might choose > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > are too big. The sweet spot might be some there in between. If we can > support one order well, we can demonstrate the value of the mTHP. We > can worry about other mix orders later. > > Do you have any suggestions for how to prevent the order 0 polluting > the higher order cluster? If we allow that to happen, then it defeats > the goal of being able to allocate higher order swap entries. The > tricky question is we don't know how much swap space we should reserve > for each order. We can always break higher order clusters to lower > order, but can't do the reserves. The current patch series lets the > actual usage determine the percentage of the cluster for each order. > However that seems not enough for the test case Barry has. When the > app gets OOM kill that is where a large swing of order 0 swap will > show up and not enough higher order usage for the brief moment. The > order 0 swap entry will pollute the high order cluster. We are > currently debating a "knob" to be able to reserve a certain % of swap > space for a certain order. Those reservations will be guaranteed and > order 0 swap entry can't pollute them even when it runs out of swap > space. That can make the mTHP at least usable for the Android case. IMO, the bottom line is that order-0 allocation is the first class citizen, we must keep it optimized. And, OOM with free swap space isn't acceptable. Please consider the policy we used for page allocation. > Do you see another way to protect the high order cluster polluted by > lower order one? If we use high-order page allocation as reference, we need something like compaction to guarantee high-order allocation finally. But we are too far from that. For specific configuration, I believe that we can get reasonable high-order swap entry allocation success rate for specific use cases. For example, if we only do limited maximum number order-0 swap entries allocation, can we keep high-order clusters? >> >> And, I checked your code again. It appears that si->percpu_cluster may >> be put in si->nonfull_cluster[], then be used by another CPU. Please >> check it. > > Ah, good point. I think it does. Let me take a closer look. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-31 2:35 ` Huang, Ying @ 2024-05-31 12:40 ` Kairui Song 2024-06-04 7:27 ` Huang, Ying 2024-06-05 7:30 ` Chris Li 2024-06-05 7:08 ` Chris Li 1 sibling, 2 replies; 43+ messages in thread From: Kairui Song @ 2024-05-31 12:40 UTC (permalink / raw) To: Huang, Ying Cc: Chris Li, Andrew Morton, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > Hi Ying, > >> > > >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Chris Li <chrisl@kernel.org> writes: > >> >> > >> >> > I am spinning a new version for this series to address two issues > >> >> > found in this series: > >> >> > > >> >> > 1) Oppo discovered a bug in the following line: > >> >> > + ci = si->cluster_info + tmp; > >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> >> > That is a serious bug but trivial to fix. > >> >> > > >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding > >> >> > the cluster->order. > >> >> > >> >> IIUC, now, we only scan swap_map[] only if > >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). > >> >> That is, if you doesn't run low swap free space, you will not do that. > >> > > >> > You can still swap space in order 0 clusters while order 4 runs out of > >> > free_cluster > >> > or nonfull_clusters[order]. For Android that is a common case. > >> > >> When we fail to allocate order 4, we will fallback to order 0. Still > >> don't need to scan swap_map[]. But after looking at your below reply, I > >> realized that the swap space is almost full at most times in your cases. > >> Then, it's possible that we run into scanning swap_map[]. > >> list_empty(&si->free_clusters) && > >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too > >> many clusters in si->percpu_cluster. So, if we want to avoid to scan > >> swap_map[], we can stop add clusters in si->percpu_cluster when swap > >> space runs low. And maybe take clusters out of si->percpu_cluster > >> sometimes. > > > > One idea after reading your reply. If we run out of the > > nonfull_cluster[order], we should be able to use other cpu's > > si->percpu_cluster[] as well. That is a very small win for Android, > > This will be useful in general. The number CPU may be large, and > multiple orders may be used. > > > because android does not have too many cpu. We are talking about a > > handful of clusters, which might not justify the code complexity. It > > does not change the behavior that order 0 can pollut higher order. > > I have a feeling that you don't really know why swap_map[] is scanned. > I suggest you to do more test and tracing to find out the reason. I > suspect that there are some non-full cluster collection issues. > > >> Another issue is nonfull_cluster[order1] cannot be used for > >> nonfull_cluster[order2]. In definition, we should not fail order 0 > >> allocation, we need to steal nonfull_cluster[order>0] for order 0 > >> allocation. This can avoid to scan swap_map[] too. This may be not > >> perfect, but it is the simplest first step implementation. You can > >> optimize based on it further. > > > > Yes, that is listed as the limitation of this cluster order approach. > > Initially we need to support one order well first. We might choose > > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > > are too big. The sweet spot might be some there in between. If we can > > support one order well, we can demonstrate the value of the mTHP. We > > can worry about other mix orders later. > > > > Do you have any suggestions for how to prevent the order 0 polluting > > the higher order cluster? If we allow that to happen, then it defeats > > the goal of being able to allocate higher order swap entries. The > > tricky question is we don't know how much swap space we should reserve > > for each order. We can always break higher order clusters to lower > > order, but can't do the reserves. The current patch series lets the > > actual usage determine the percentage of the cluster for each order. > > However that seems not enough for the test case Barry has. When the > > app gets OOM kill that is where a large swing of order 0 swap will > > show up and not enough higher order usage for the brief moment. The > > order 0 swap entry will pollute the high order cluster. We are > > currently debating a "knob" to be able to reserve a certain % of swap > > space for a certain order. Those reservations will be guaranteed and > > order 0 swap entry can't pollute them even when it runs out of swap > > space. That can make the mTHP at least usable for the Android case. > > IMO, the bottom line is that order-0 allocation is the first class > citizen, we must keep it optimized. And, OOM with free swap space isn't > acceptable. Please consider the policy we used for page allocation. > > > Do you see another way to protect the high order cluster polluted by > > lower order one? > > If we use high-order page allocation as reference, we need something > like compaction to guarantee high-order allocation finally. But we are > too far from that. > > For specific configuration, I believe that we can get reasonable > high-order swap entry allocation success rate for specific use cases. > For example, if we only do limited maximum number order-0 swap entries > allocation, can we keep high-order clusters? Isn't limiting order-0 allocation breaks the bottom line that order-0 allocation is the first class citizen, and should not fail if there is space? Just my two cents... I had a try locally based on Chris's work, allowing order 0 to use nonfull_clusters as Ying has suggested, and starting with low order and increase the order until nonfull_cluster[order] is not empty, that way higher order is just better protected, because unless we ran out of free_cluster and nonfull_cluster, direct scan won't happen. More concretely, I applied the following changes, which didn't change the code much: - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then free_clusters, then discard_cluster. - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster returns false. A quick test still using the memtier test, but decreased the swap device size from 10G to 8g for higher pressure. Before: hugepages-32kB/stats/swpout:34013 hugepages-32kB/stats/swpout_fallback:266 hugepages-512kB/stats/swpout:0 hugepages-512kB/stats/swpout_fallback:77 hugepages-2048kB/stats/swpout:0 hugepages-2048kB/stats/swpout_fallback:1 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:35088 hugepages-64kB/stats/swpout_fallback:66 hugepages-16kB/stats/swpout:31848 hugepages-16kB/stats/swpout_fallback:402 hugepages-256kB/stats/swpout:390 hugepages-256kB/stats/swpout_fallback:7244 hugepages-128kB/stats/swpout:28573 hugepages-128kB/stats/swpout_fallback:474 After: hugepages-32kB/stats/swpout:31448 hugepages-32kB/stats/swpout_fallback:3354 hugepages-512kB/stats/swpout:30 hugepages-512kB/stats/swpout_fallback:33 hugepages-2048kB/stats/swpout:2 hugepages-2048kB/stats/swpout_fallback:0 hugepages-1024kB/stats/swpout:0 hugepages-1024kB/stats/swpout_fallback:0 hugepages-64kB/stats/swpout:31255 hugepages-64kB/stats/swpout_fallback:3112 hugepages-16kB/stats/swpout:29931 hugepages-16kB/stats/swpout_fallback:3397 hugepages-256kB/stats/swpout:5223 hugepages-256kB/stats/swpout_fallback:2351 hugepages-128kB/stats/swpout:25600 hugepages-128kB/stats/swpout_fallback:2194 High order (256k) swapout rate are significantly higher, 512k is now possible, which indicate high orders are better protected, lower orders are sacrificed but seems worth it. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-31 12:40 ` Kairui Song @ 2024-06-04 7:27 ` Huang, Ying 2024-06-05 7:40 ` Chris Li 2024-06-05 7:30 ` Chris Li 1 sibling, 1 reply; 43+ messages in thread From: Huang, Ying @ 2024-06-04 7:27 UTC (permalink / raw) To: Kairui Song Cc: Chris Li, Andrew Morton, Ryan Roberts, linux-kernel, linux-mm, Barry Song Kairui Song <ryncsn@gmail.com> writes: > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > Hi Ying, >> >> > >> >> > On Wed, May 29, 2024 at 1:57 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> >> >> > I am spinning a new version for this series to address two issues >> >> >> > found in this series: >> >> >> > >> >> >> > 1) Oppo discovered a bug in the following line: >> >> >> > + ci = si->cluster_info + tmp; >> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> >> >> > That is a serious bug but trivial to fix. >> >> >> > >> >> >> > 2) order 0 allocation currently blindly scans swap_map disregarding >> >> >> > the cluster->order. >> >> >> >> >> >> IIUC, now, we only scan swap_map[] only if >> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_clusters[order]). >> >> >> That is, if you doesn't run low swap free space, you will not do that. >> >> > >> >> > You can still swap space in order 0 clusters while order 4 runs out of >> >> > free_cluster >> >> > or nonfull_clusters[order]. For Android that is a common case. >> >> >> >> When we fail to allocate order 4, we will fallback to order 0. Still >> >> don't need to scan swap_map[]. But after looking at your below reply, I >> >> realized that the swap space is almost full at most times in your cases. >> >> Then, it's possible that we run into scanning swap_map[]. >> >> list_empty(&si->free_clusters) && >> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put too >> >> many clusters in si->percpu_cluster. So, if we want to avoid to scan >> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap >> >> space runs low. And maybe take clusters out of si->percpu_cluster >> >> sometimes. >> > >> > One idea after reading your reply. If we run out of the >> > nonfull_cluster[order], we should be able to use other cpu's >> > si->percpu_cluster[] as well. That is a very small win for Android, >> >> This will be useful in general. The number CPU may be large, and >> multiple orders may be used. >> >> > because android does not have too many cpu. We are talking about a >> > handful of clusters, which might not justify the code complexity. It >> > does not change the behavior that order 0 can pollut higher order. >> >> I have a feeling that you don't really know why swap_map[] is scanned. >> I suggest you to do more test and tracing to find out the reason. I >> suspect that there are some non-full cluster collection issues. >> >> >> Another issue is nonfull_cluster[order1] cannot be used for >> >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> >> allocation. This can avoid to scan swap_map[] too. This may be not >> >> perfect, but it is the simplest first step implementation. You can >> >> optimize based on it further. >> > >> > Yes, that is listed as the limitation of this cluster order approach. >> > Initially we need to support one order well first. We might choose >> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages >> > are too big. The sweet spot might be some there in between. If we can >> > support one order well, we can demonstrate the value of the mTHP. We >> > can worry about other mix orders later. >> > >> > Do you have any suggestions for how to prevent the order 0 polluting >> > the higher order cluster? If we allow that to happen, then it defeats >> > the goal of being able to allocate higher order swap entries. The >> > tricky question is we don't know how much swap space we should reserve >> > for each order. We can always break higher order clusters to lower >> > order, but can't do the reserves. The current patch series lets the >> > actual usage determine the percentage of the cluster for each order. >> > However that seems not enough for the test case Barry has. When the >> > app gets OOM kill that is where a large swing of order 0 swap will >> > show up and not enough higher order usage for the brief moment. The >> > order 0 swap entry will pollute the high order cluster. We are >> > currently debating a "knob" to be able to reserve a certain % of swap >> > space for a certain order. Those reservations will be guaranteed and >> > order 0 swap entry can't pollute them even when it runs out of swap >> > space. That can make the mTHP at least usable for the Android case. >> >> IMO, the bottom line is that order-0 allocation is the first class >> citizen, we must keep it optimized. And, OOM with free swap space isn't >> acceptable. Please consider the policy we used for page allocation. >> >> > Do you see another way to protect the high order cluster polluted by >> > lower order one? >> >> If we use high-order page allocation as reference, we need something >> like compaction to guarantee high-order allocation finally. But we are >> too far from that. >> >> For specific configuration, I believe that we can get reasonable >> high-order swap entry allocation success rate for specific use cases. >> For example, if we only do limited maximum number order-0 swap entries >> allocation, can we keep high-order clusters? > > Isn't limiting order-0 allocation breaks the bottom line that order-0 > allocation is the first class citizen, and should not fail if there is > space? Sorry for confusing words. I mean limiting maximum number order-0 swap entries allocation in workloads, instead of limiting that in kernel. > Just my two cents... > > I had a try locally based on Chris's work, allowing order 0 to use > nonfull_clusters as Ying has suggested, and starting with low order > and increase the order until nonfull_cluster[order] is not empty, that > way higher order is just better protected, because unless we ran out > of free_cluster and nonfull_cluster, direct scan won't happen. > > More concretely, I applied the following changes, which didn't change > the code much: > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then > free_clusters, then discard_cluster. > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster > returns false. > > A quick test still using the memtier test, but decreased the swap > device size from 10G to 8g for higher pressure. > > Before: > hugepages-32kB/stats/swpout:34013 > hugepages-32kB/stats/swpout_fallback:266 > hugepages-512kB/stats/swpout:0 > hugepages-512kB/stats/swpout_fallback:77 > hugepages-2048kB/stats/swpout:0 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35088 > hugepages-64kB/stats/swpout_fallback:66 > hugepages-16kB/stats/swpout:31848 > hugepages-16kB/stats/swpout_fallback:402 > hugepages-256kB/stats/swpout:390 > hugepages-256kB/stats/swpout_fallback:7244 > hugepages-128kB/stats/swpout:28573 > hugepages-128kB/stats/swpout_fallback:474 > > After: > hugepages-32kB/stats/swpout:31448 > hugepages-32kB/stats/swpout_fallback:3354 > hugepages-512kB/stats/swpout:30 > hugepages-512kB/stats/swpout_fallback:33 > hugepages-2048kB/stats/swpout:2 > hugepages-2048kB/stats/swpout_fallback:0 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:31255 > hugepages-64kB/stats/swpout_fallback:3112 > hugepages-16kB/stats/swpout:29931 > hugepages-16kB/stats/swpout_fallback:3397 > hugepages-256kB/stats/swpout:5223 > hugepages-256kB/stats/swpout_fallback:2351 > hugepages-128kB/stats/swpout:25600 > hugepages-128kB/stats/swpout_fallback:2194 > > High order (256k) swapout rate are significantly higher, 512k is now > possible, which indicate high orders are better protected, lower > orders are sacrificed but seems worth it. Yes. I think that this reflects another aspect of the problem. In some situations, it's better to steal one high-order cluster and use it for order-0 allocation instead of scattering order-0 allocation in random high-order clusters. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-04 7:27 ` Huang, Ying @ 2024-06-05 7:40 ` Chris Li 0 siblings, 0 replies; 43+ messages in thread From: Chris Li @ 2024-06-05 7:40 UTC (permalink / raw) To: Huang, Ying Cc: Kairui Song, Andrew Morton, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Tue, Jun 4, 2024 at 12:29 AM Huang, Ying <ying.huang@intel.com> wrote: > > Kairui Song <ryncsn@gmail.com> writes: > > > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: > > Isn't limiting order-0 allocation breaks the bottom line that order-0 > > allocation is the first class citizen, and should not fail if there is > > space? > > Sorry for confusing words. I mean limiting maximum number order-0 swap > entries allocation in workloads, instead of limiting that in kernel. What interface does it use to limit the order 0 swap entries? I was thinking the kernel would enforce the high order swap space reservation just like hugetlbfs does on huge pages. We will need to introduce some interface to specify the reservation. > > > Just my two cents... > > > > I had a try locally based on Chris's work, allowing order 0 to use > > nonfull_clusters as Ying has suggested, and starting with low order > > and increase the order until nonfull_cluster[order] is not empty, that > > way higher order is just better protected, because unless we ran out > > of free_cluster and nonfull_cluster, direct scan won't happen. > > > > More concretely, I applied the following changes, which didn't change > > the code much: > > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then > > free_clusters, then discard_cluster. > > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) > > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster > > returns false. > > > > A quick test still using the memtier test, but decreased the swap > > device size from 10G to 8g for higher pressure. > > > > Before: > > hugepages-32kB/stats/swpout:34013 > > hugepages-32kB/stats/swpout_fallback:266 > > hugepages-512kB/stats/swpout:0 > > hugepages-512kB/stats/swpout_fallback:77 > > hugepages-2048kB/stats/swpout:0 > > hugepages-2048kB/stats/swpout_fallback:1 > > hugepages-1024kB/stats/swpout:0 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:35088 > > hugepages-64kB/stats/swpout_fallback:66 > > hugepages-16kB/stats/swpout:31848 > > hugepages-16kB/stats/swpout_fallback:402 > > hugepages-256kB/stats/swpout:390 > > hugepages-256kB/stats/swpout_fallback:7244 > > hugepages-128kB/stats/swpout:28573 > > hugepages-128kB/stats/swpout_fallback:474 > > > > After: > > hugepages-32kB/stats/swpout:31448 > > hugepages-32kB/stats/swpout_fallback:3354 > > hugepages-512kB/stats/swpout:30 > > hugepages-512kB/stats/swpout_fallback:33 > > hugepages-2048kB/stats/swpout:2 > > hugepages-2048kB/stats/swpout_fallback:0 > > hugepages-1024kB/stats/swpout:0 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:31255 > > hugepages-64kB/stats/swpout_fallback:3112 > > hugepages-16kB/stats/swpout:29931 > > hugepages-16kB/stats/swpout_fallback:3397 > > hugepages-256kB/stats/swpout:5223 > > hugepages-256kB/stats/swpout_fallback:2351 > > hugepages-128kB/stats/swpout:25600 > > hugepages-128kB/stats/swpout_fallback:2194 > > > > High order (256k) swapout rate are significantly higher, 512k is now > > possible, which indicate high orders are better protected, lower > > orders are sacrificed but seems worth it. > > Yes. I think that this reflects another aspect of the problem. In some > situations, it's better to steal one high-order cluster and use it for > order-0 allocation instead of scattering order-0 allocation in random > high-order clusters. Agree, the scan loop on swap_map[] has the worst pollution. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-31 12:40 ` Kairui Song 2024-06-04 7:27 ` Huang, Ying @ 2024-06-05 7:30 ` Chris Li 1 sibling, 0 replies; 43+ messages in thread From: Chris Li @ 2024-06-05 7:30 UTC (permalink / raw) To: Kairui Song Cc: Huang, Ying, Andrew Morton, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Fri, May 31, 2024 at 5:40 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Fri, May 31, 2024 at 10:37 AM Huang, Ying <ying.huang@intel.com> wrote: > > > > For specific configuration, I believe that we can get reasonable > > high-order swap entry allocation success rate for specific use cases. > > For example, if we only do limited maximum number order-0 swap entries > > allocation, can we keep high-order clusters? > > Isn't limiting order-0 allocation breaks the bottom line that order-0 > allocation is the first class citizen, and should not fail if there is > space? We need to have high order and low order swap allocation working. Able to recover from the swapfile full case. > > Just my two cents... > > I had a try locally based on Chris's work, allowing order 0 to use > nonfull_clusters as Ying has suggested, and starting with low order > and increase the order until nonfull_cluster[order] is not empty, that > way higher order is just better protected, because unless we ran out > of free_cluster and nonfull_cluster, direct scan won't happen. That does not help the Android test case Barry is running because Android tries to keep the swapfile full. It will hit the case both empty and nonfull list all used up. When it performs the low memory kill. There will be a big change in the ratio of low vs high order swap. Allocating high order swap entries should be able to recover from that. > > More concretely, I applied the following changes, which didn't change > the code much: > - In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then > free_clusters, then discard_cluster. I consider high the nonfull list before the empty list. The current allocation tries to make the HAS_CACHE only swap entry stay in the disk for a longer time before recycling it. If the folio is still in swap cache and not dirty, it can skip the write out and directly reuse the swap slot during reclaim. I am not sure this code path is important now, it seems when the swap slot is free, it will remove the HAS_CACHE as well. BTW, I noticed that the discard cluster doesn't check if the swap cache has a folio point to it. After discarding it just set the swap_map to 0. I wonder if swap cache has a folio in that discarded slot that would hit the skip writeback logic. If that is triggerable, it would be a corruption bug. The current SSD allocation also has some command said old SSD can benefit from not writing to the same block too many times to help the wear leveling. I don't think that is a big deal now, even cheap SD cards have wear leveling nowadays. > - If it's order 0, also check for (int i = 0; i < SWAP_NR_ORDERS; ++i) > nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster > returns false. Ideally to have some option to reserve some high order swap space so order 0 can't pollute high order clusters. Chris > > A quick test still using the memtier test, but decreased the swap > device size from 10G to 8g for higher pressure. > > Before: > hugepages-32kB/stats/swpout:34013 > hugepages-32kB/stats/swpout_fallback:266 > hugepages-512kB/stats/swpout:0 > hugepages-512kB/stats/swpout_fallback:77 > hugepages-2048kB/stats/swpout:0 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35088 > hugepages-64kB/stats/swpout_fallback:66 > hugepages-16kB/stats/swpout:31848 > hugepages-16kB/stats/swpout_fallback:402 > hugepages-256kB/stats/swpout:390 > hugepages-256kB/stats/swpout_fallback:7244 > hugepages-128kB/stats/swpout:28573 > hugepages-128kB/stats/swpout_fallback:474 > > After: > hugepages-32kB/stats/swpout:31448 > hugepages-32kB/stats/swpout_fallback:3354 > hugepages-512kB/stats/swpout:30 > hugepages-512kB/stats/swpout_fallback:33 > hugepages-2048kB/stats/swpout:2 > hugepages-2048kB/stats/swpout_fallback:0 > hugepages-1024kB/stats/swpout:0 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:31255 > hugepages-64kB/stats/swpout_fallback:3112 > hugepages-16kB/stats/swpout:29931 > hugepages-16kB/stats/swpout_fallback:3397 > hugepages-256kB/stats/swpout:5223 > hugepages-256kB/stats/swpout_fallback:2351 > hugepages-128kB/stats/swpout:25600 > hugepages-128kB/stats/swpout_fallback:2194 > > High order (256k) swapout rate are significantly higher, 512k is now > possible, which indicate high orders are better protected, lower > orders are sacrificed but seems worth it. > ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-31 2:35 ` Huang, Ying 2024-05-31 12:40 ` Kairui Song @ 2024-06-05 7:08 ` Chris Li 2024-06-06 1:55 ` Huang, Ying 1 sibling, 1 reply; 43+ messages in thread From: Chris Li @ 2024-06-05 7:08 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: > > because android does not have too many cpu. We are talking about a > > handful of clusters, which might not justify the code complexity. It > > does not change the behavior that order 0 can pollut higher order. > > I have a feeling that you don't really know why swap_map[] is scanned. > I suggest you to do more test and tracing to find out the reason. I > suspect that there are some non-full cluster collection issues. Swap_map[] is scanned because of running out of non full clusters. This can happen because Android tries to make full use of the swapfile. However, once the swap_map[] scan happens, the non full cluster is polluted. I currently don't have a local reproduction of the issue Barry reported. However here is some data point: Two swap files, one for high order allocation only with this patch. No fall back. If there is a non-full cluster collection issue, we should see the fall back in this case as well. BTW, same setup without this patch series it will fall back on the high order allocation as well. > > >> Another issue is nonfull_cluster[order1] cannot be used for > >> nonfull_cluster[order2]. In definition, we should not fail order 0 > >> allocation, we need to steal nonfull_cluster[order>0] for order 0 > >> allocation. This can avoid to scan swap_map[] too. This may be not > >> perfect, but it is the simplest first step implementation. You can > >> optimize based on it further. > > > > Yes, that is listed as the limitation of this cluster order approach. > > Initially we need to support one order well first. We might choose > > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages > > are too big. The sweet spot might be some there in between. If we can > > support one order well, we can demonstrate the value of the mTHP. We > > can worry about other mix orders later. > > > > Do you have any suggestions for how to prevent the order 0 polluting > > the higher order cluster? If we allow that to happen, then it defeats > > the goal of being able to allocate higher order swap entries. The > > tricky question is we don't know how much swap space we should reserve > > for each order. We can always break higher order clusters to lower > > order, but can't do the reserves. The current patch series lets the > > actual usage determine the percentage of the cluster for each order. > > However that seems not enough for the test case Barry has. When the > > app gets OOM kill that is where a large swing of order 0 swap will > > show up and not enough higher order usage for the brief moment. The > > order 0 swap entry will pollute the high order cluster. We are > > currently debating a "knob" to be able to reserve a certain % of swap > > space for a certain order. Those reservations will be guaranteed and > > order 0 swap entry can't pollute them even when it runs out of swap > > space. That can make the mTHP at least usable for the Android case. > > IMO, the bottom line is that order-0 allocation is the first class > citizen, we must keep it optimized. And, OOM with free swap space isn't > acceptable. Please consider the policy we used for page allocation. We need to make order-0 and high order allocation both can work after the initial pass of allocating from empty clusters. Only order-0 allocation work is not good enough. In the page allocation side, we have the hugetlbfs which reserve some memory for high order pages. We should have similar things to allow reserve some high order swap entries without getting polluted by low order one. > > > Do you see another way to protect the high order cluster polluted by > > lower order one? > > If we use high-order page allocation as reference, we need something > like compaction to guarantee high-order allocation finally. But we are > too far from that. We should consider reservation for high-order swap entry allocation similar to hugetlbfs for memory. Swap compaction will be very complicated because it needs to scan the PTE to migrate the swap entry. It might be easier to support folio write out compound discontiguous swap entries. That is another way to address the fragmentation issue. We are also too far from that as right now. > > For specific configuration, I believe that we can get reasonable > high-order swap entry allocation success rate for specific use cases. > For example, if we only do limited maximum number order-0 swap entries > allocation, can we keep high-order clusters? Yes we can. Having a knob to reserve some high order swap space. Limiting order 0 is the same as having some high order swap entries reserved. That is a short term solution. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-05 7:08 ` Chris Li @ 2024-06-06 1:55 ` Huang, Ying 2024-06-07 18:40 ` Chris Li 0 siblings, 1 reply; 43+ messages in thread From: Huang, Ying @ 2024-06-06 1:55 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > On Thu, May 30, 2024 at 7:37 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Wed, May 29, 2024 at 7:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> > because android does not have too many cpu. We are talking about a >> > handful of clusters, which might not justify the code complexity. It >> > does not change the behavior that order 0 can pollut higher order. >> >> I have a feeling that you don't really know why swap_map[] is scanned. >> I suggest you to do more test and tracing to find out the reason. I >> suspect that there are some non-full cluster collection issues. > > Swap_map[] is scanned because of running out of non full clusters. > This can happen because Android tries to make full use of the swapfile. > However, once the swap_map[] scan happens, the non full cluster is polluted. > > I currently don't have a local reproduction of the issue Barry reported. > However here is some data point: > Two swap files, one for high order allocation only with this patch. No > fall back. > If there is a non-full cluster collection issue, we should see the > fall back in this case as well. > > BTW, same setup without this patch series it will fall back on the > high order allocation as well. > >> >> >> Another issue is nonfull_cluster[order1] cannot be used for >> >> nonfull_cluster[order2]. In definition, we should not fail order 0 >> >> allocation, we need to steal nonfull_cluster[order>0] for order 0 >> >> allocation. This can avoid to scan swap_map[] too. This may be not >> >> perfect, but it is the simplest first step implementation. You can >> >> optimize based on it further. >> > >> > Yes, that is listed as the limitation of this cluster order approach. >> > Initially we need to support one order well first. We might choose >> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages >> > are too big. The sweet spot might be some there in between. If we can >> > support one order well, we can demonstrate the value of the mTHP. We >> > can worry about other mix orders later. >> > >> > Do you have any suggestions for how to prevent the order 0 polluting >> > the higher order cluster? If we allow that to happen, then it defeats >> > the goal of being able to allocate higher order swap entries. The >> > tricky question is we don't know how much swap space we should reserve >> > for each order. We can always break higher order clusters to lower >> > order, but can't do the reserves. The current patch series lets the >> > actual usage determine the percentage of the cluster for each order. >> > However that seems not enough for the test case Barry has. When the >> > app gets OOM kill that is where a large swing of order 0 swap will >> > show up and not enough higher order usage for the brief moment. The >> > order 0 swap entry will pollute the high order cluster. We are >> > currently debating a "knob" to be able to reserve a certain % of swap >> > space for a certain order. Those reservations will be guaranteed and >> > order 0 swap entry can't pollute them even when it runs out of swap >> > space. That can make the mTHP at least usable for the Android case. >> >> IMO, the bottom line is that order-0 allocation is the first class >> citizen, we must keep it optimized. And, OOM with free swap space isn't >> acceptable. Please consider the policy we used for page allocation. > > We need to make order-0 and high order allocation both can work after > the initial pass of allocating from empty clusters. > Only order-0 allocation work is not good enough. > > In the page allocation side, we have the hugetlbfs which reserve some > memory for high order pages. > We should have similar things to allow reserve some high order swap > entries without getting polluted by low order one. TBH, I don't like the idea of high order swap entries reservation. If that's really important for you, I think that it's better to design something like hugetlbfs vs core mm, that is, be separated from the normal swap subsystem as much as possible. >> >> > Do you see another way to protect the high order cluster polluted by >> > lower order one? >> >> If we use high-order page allocation as reference, we need something >> like compaction to guarantee high-order allocation finally. But we are >> too far from that. > > We should consider reservation for high-order swap entry allocation > similar to hugetlbfs for memory. > Swap compaction will be very complicated because it needs to scan the > PTE to migrate the swap entry. It might be easier to support folio > write out compound discontiguous swap entries. That is another way to > address the fragmentation issue. We are also too far from that as > right now. That's not easy to write out compound discontiguous swap entries too. For example, how to put folios in swap cache? >> >> For specific configuration, I believe that we can get reasonable >> high-order swap entry allocation success rate for specific use cases. >> For example, if we only do limited maximum number order-0 swap entries >> allocation, can we keep high-order clusters? > > Yes we can. Having a knob to reserve some high order swap space. > Limiting order 0 is the same as having some high order swap entries > reserved. > > That is a short term solution. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-06 1:55 ` Huang, Ying @ 2024-06-07 18:40 ` Chris Li 2024-06-11 2:36 ` Huang, Ying 0 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-06-07 18:40 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > In the page allocation side, we have the hugetlbfs which reserve some > > memory for high order pages. > > We should have similar things to allow reserve some high order swap > > entries without getting polluted by low order one. > > TBH, I don't like the idea of high order swap entries reservation. May I know more if you don't like the idea? I understand this can be controversial, because previously we like to take the THP as the best effort approach. If there is some reason we can't make THP, we use the order 0 as fall back. For discussion purpose, I want break it down to smaller steps: First, can we agree that the following usage case is reasonable: The usage case is that, as Barry has shown, zsmalloc compresses bigger size than 4K and can have both better compress ratio and CPU performance gain. https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ So the goal is to make THP/mTHP have some reasonable success rate running in the mix size swap allocation, after either low order or high order swap requests can overflow the swap file size. The allocate can still recover from that, after some swap entries got free. Please let me know if you think the above usage case and goal are not reasonable for the kernel. > that's really important for you, I think that it's better to design > something like hugetlbfs vs core mm, that is, be separated from the > normal swap subsystem as much as possible. I am giving hugetlbfs just to make the point using reservation, or isolation of the resource to prevent mixing fragmentation existing in core mm. I am not suggesting copying the hugetlbfs implementation to the swap system. Unlike hugetlbfs, the swap allocation is typically done from the kernel, it is transparent from the application. I don't think separate from the swap subsystem is a good way to go. This comes down to why you don't like the reservation. e.g. if we use two swapfile, one swapfile is purely allocate for high order, would that be better? > > >> > >> > Do you see another way to protect the high order cluster polluted by > >> > lower order one? > >> > >> If we use high-order page allocation as reference, we need something > >> like compaction to guarantee high-order allocation finally. But we are > >> too far from that. > > > > We should consider reservation for high-order swap entry allocation > > similar to hugetlbfs for memory. > > Swap compaction will be very complicated because it needs to scan the > > PTE to migrate the swap entry. It might be easier to support folio > > write out compound discontiguous swap entries. That is another way to > > address the fragmentation issue. We are also too far from that as > > right now. > > That's not easy to write out compound discontiguous swap entries too. > For example, how to put folios in swap cache? I propose the idea in the recent LSF/MM discussion, the last few slides are for the discontiguous swap and it has the discontiguous entries in swap cache. https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view Agree it is not an easy change. The cache cache would have to change the assumption all offset are contiguous. For swap, we kind of have some in memory data associated with per offset already, so it might provide an opportunity to combine the offset related data structure for swap together. Another alternative might be using xarray without the multi entry property. , just treat each offset like a single entry. I haven't dug deep into this direction yet. We can have more discussion, maybe arrange an upstream alignment meeting if there is interest. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-07 18:40 ` Chris Li @ 2024-06-11 2:36 ` Huang, Ying 2024-06-11 7:11 ` Chris Li 0 siblings, 1 reply; 43+ messages in thread From: Huang, Ying @ 2024-06-11 2:36 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> > >> > In the page allocation side, we have the hugetlbfs which reserve some >> > memory for high order pages. >> > We should have similar things to allow reserve some high order swap >> > entries without getting polluted by low order one. >> >> TBH, I don't like the idea of high order swap entries reservation. > May I know more if you don't like the idea? I understand this can be > controversial, because previously we like to take the THP as the best > effort approach. If there is some reason we can't make THP, we use the > order 0 as fall back. > > For discussion purpose, I want break it down to smaller steps: > > First, can we agree that the following usage case is reasonable: > The usage case is that, as Barry has shown, zsmalloc compresses bigger > size than 4K and can have both better compress ratio and CPU > performance gain. > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > > So the goal is to make THP/mTHP have some reasonable success rate > running in the mix size swap allocation, after either low order or > high order swap requests can overflow the swap file size. The allocate > can still recover from that, after some swap entries got free. > > Please let me know if you think the above usage case and goal are not > reasonable for the kernel. I think that it's reasonable to improve the success rate of high-order swap entries allocation. I just think that it's hard to use the reservation based method. For example, how much should be reserved? Why system OOM when there's still swap space available? And so forth. So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> that's really important for you, I think that it's better to design >> something like hugetlbfs vs core mm, that is, be separated from the >> normal swap subsystem as much as possible. > > I am giving hugetlbfs just to make the point using reservation, or > isolation of the resource to prevent mixing fragmentation existing in > core mm. > I am not suggesting copying the hugetlbfs implementation to the swap > system. Unlike hugetlbfs, the swap allocation is typically done from > the kernel, it is transparent from the application. I don't think > separate from the swap subsystem is a good way to go. > > This comes down to why you don't like the reservation. e.g. if we use > two swapfile, one swapfile is purely allocate for high order, would > that be better? Sorry, my words weren't accurate. Personally, I just think that it's better to make reservation related code not too intrusive. And, before reservation, we need to consider something else firstly. Whether is it generally good to swap-in with swap-out order? Should we consider memory wastage too? One static policy doesn't fit all, we may need either a dynamic policy, or make policy configurable. In general, I think that we need to do this step by step. >> >> > Do you see another way to protect the high order cluster polluted by >> >> > lower order one? >> >> >> >> If we use high-order page allocation as reference, we need something >> >> like compaction to guarantee high-order allocation finally. But we are >> >> too far from that. >> > >> > We should consider reservation for high-order swap entry allocation >> > similar to hugetlbfs for memory. >> > Swap compaction will be very complicated because it needs to scan the >> > PTE to migrate the swap entry. It might be easier to support folio >> > write out compound discontiguous swap entries. That is another way to >> > address the fragmentation issue. We are also too far from that as >> > right now. >> >> That's not easy to write out compound discontiguous swap entries too. >> For example, how to put folios in swap cache? > > I propose the idea in the recent LSF/MM discussion, the last few > slides are for the discontiguous swap and it has the discontiguous > entries in swap cache. > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view > > Agree it is not an easy change. The cache cache would have to change > the assumption all offset are contiguous. > For swap, we kind of have some in memory data associated with per > offset already, so it might provide an opportunity to combine the > offset related data structure for swap together. Another alternative > might be using xarray without the multi entry property. , just treat > each offset like a single entry. I haven't dug deep into this > direction yet. Thanks! I will study your idea. > We can have more discussion, maybe arrange an upstream alignment > meeting if there is interest. Sure. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-11 2:36 ` Huang, Ying @ 2024-06-11 7:11 ` Chris Li 2024-06-13 8:38 ` Huang, Ying 0 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-06-11 7:11 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > > > >> > In the page allocation side, we have the hugetlbfs which reserve some > >> > memory for high order pages. > >> > We should have similar things to allow reserve some high order swap > >> > entries without getting polluted by low order one. > >> > >> TBH, I don't like the idea of high order swap entries reservation. > > May I know more if you don't like the idea? I understand this can be > > controversial, because previously we like to take the THP as the best > > effort approach. If there is some reason we can't make THP, we use the > > order 0 as fall back. > > > > For discussion purpose, I want break it down to smaller steps: > > > > First, can we agree that the following usage case is reasonable: > > The usage case is that, as Barry has shown, zsmalloc compresses bigger > > size than 4K and can have both better compress ratio and CPU > > performance gain. > > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > > > > So the goal is to make THP/mTHP have some reasonable success rate > > running in the mix size swap allocation, after either low order or > > high order swap requests can overflow the swap file size. The allocate > > can still recover from that, after some swap entries got free. > > > > Please let me know if you think the above usage case and goal are not > > reasonable for the kernel. > > I think that it's reasonable to improve the success rate of high-order Glad to hear that. > swap entries allocation. I just think that it's hard to use the > reservation based method. For example, how much should be reserved? Understand, it is harder to use than a fully transparent method, but still better than no solution at all. The alternative right now is we can't do it. Regarding how much we should reserve. Similarly, how much should you choose your swap file size? If you choose N, why not N*120% or N*80%? That did not stop us from having a swapfile, right? > Why system OOM when there's still swap space available? And so forth. Keep in mind that the reservation is an option. If you prefer the old behavior, you don't have to use the reservation. That shouldn't be a reason to stop others who want to use it. We don't have an alternative solution for the long run mix size allocation yet. If there is, I like to hear it. > So, I prefer the transparent methods. Just like THP vs. hugetlbfs. Me too. I prefer transparent over reservation if it can achieve the same goal. Do we have a fully transparent method spec out? How to achieve fully transparent and also avoid fragmentation caused by mix order allocation/free? Keep in mind that we are still in the early stage of the mTHP swap development, I can have the reservation patch relatively easily. If you come up with a better transparent method patch which can achieve the same goal later, we can use it instead. > > >> that's really important for you, I think that it's better to design > >> something like hugetlbfs vs core mm, that is, be separated from the > >> normal swap subsystem as much as possible. > > > > I am giving hugetlbfs just to make the point using reservation, or > > isolation of the resource to prevent mixing fragmentation existing in > > core mm. > > I am not suggesting copying the hugetlbfs implementation to the swap > > system. Unlike hugetlbfs, the swap allocation is typically done from > > the kernel, it is transparent from the application. I don't think > > separate from the swap subsystem is a good way to go. > > > > This comes down to why you don't like the reservation. e.g. if we use > > two swapfile, one swapfile is purely allocate for high order, would > > that be better? > > Sorry, my words weren't accurate. Personally, I just think that it's > better to make reservation related code not too intrusive. Yes. I will try to make it not too intrusive. > And, before reservation, we need to consider something else firstly. > Whether is it generally good to swap-in with swap-out order? Should we When we have the reservation patch (or other means to sustain mix size swap allocation/free), we can test it out to get more data to reason about it. I consider the swap in size policy an orthogonal issue. > consider memory wastage too? One static policy doesn't fit all, we may > need either a dynamic policy, or make policy configurable. > In general, I think that we need to do this step by step. The core swap layer needs to be able to sustain mix size swap allocation free in the long run. Without that the swap in size policy is meaningless. Yes, that is the step by step approach. Allowing long run mix size swap allocation as the first step. > >> >> > Do you see another way to protect the high order cluster polluted by > >> >> > lower order one? > >> >> > >> >> If we use high-order page allocation as reference, we need something > >> >> like compaction to guarantee high-order allocation finally. But we are > >> >> too far from that. > >> > > >> > We should consider reservation for high-order swap entry allocation > >> > similar to hugetlbfs for memory. > >> > Swap compaction will be very complicated because it needs to scan the > >> > PTE to migrate the swap entry. It might be easier to support folio > >> > write out compound discontiguous swap entries. That is another way to > >> > address the fragmentation issue. We are also too far from that as > >> > right now. > >> > >> That's not easy to write out compound discontiguous swap entries too. > >> For example, how to put folios in swap cache? > > > > I propose the idea in the recent LSF/MM discussion, the last few > > slides are for the discontiguous swap and it has the discontiguous > > entries in swap cache. > > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view > > > > Agree it is not an easy change. The cache cache would have to change > > the assumption all offset are contiguous. > > For swap, we kind of have some in memory data associated with per > > offset already, so it might provide an opportunity to combine the > > offset related data structure for swap together. Another alternative > > might be using xarray without the multi entry property. , just treat > > each offset like a single entry. I haven't dug deep into this > > direction yet. > > Thanks! I will study your idea. > I am happy to discuss if you have any questions. > > We can have more discussion, maybe arrange an upstream alignment > > meeting if there is interest. > > Sure. Ideally, if we can resolve our differences over the mail list then we don't need to have a separate meeting :-) Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-11 7:11 ` Chris Li @ 2024-06-13 8:38 ` Huang, Ying 2024-06-18 4:35 ` Chris Li 0 siblings, 1 reply; 43+ messages in thread From: Huang, Ying @ 2024-06-13 8:38 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> > >> >> > In the page allocation side, we have the hugetlbfs which reserve some >> >> > memory for high order pages. >> >> > We should have similar things to allow reserve some high order swap >> >> > entries without getting polluted by low order one. >> >> >> >> TBH, I don't like the idea of high order swap entries reservation. >> > May I know more if you don't like the idea? I understand this can be >> > controversial, because previously we like to take the THP as the best >> > effort approach. If there is some reason we can't make THP, we use the >> > order 0 as fall back. >> > >> > For discussion purpose, I want break it down to smaller steps: >> > >> > First, can we agree that the following usage case is reasonable: >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger >> > size than 4K and can have both better compress ratio and CPU >> > performance gain. >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ >> > >> > So the goal is to make THP/mTHP have some reasonable success rate >> > running in the mix size swap allocation, after either low order or >> > high order swap requests can overflow the swap file size. The allocate >> > can still recover from that, after some swap entries got free. >> > >> > Please let me know if you think the above usage case and goal are not >> > reasonable for the kernel. >> >> I think that it's reasonable to improve the success rate of high-order > > Glad to hear that. > >> swap entries allocation. I just think that it's hard to use the >> reservation based method. For example, how much should be reserved? > > Understand, it is harder to use than a fully transparent method, but > still better than no solution at all. The alternative right now is we > can't do it. > > Regarding how much we should reserve. Similarly, how much should you > choose your swap file size? If you choose N, why not N*120% or N*80%? > That did not stop us from having a swapfile, right? > >> Why system OOM when there's still swap space available? And so forth. > > Keep in mind that the reservation is an option. If you prefer the old > behavior, you don't have to use the reservation. That shouldn't be a > reason to stop others who want to use it. We don't have an alternative > solution for the long run mix size allocation yet. If there is, I like > to hear it. It's not enough to make it optional. When you run into issue, you need to debug it. And you may debug an issue on a system that is configured by someone else. >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > > Me too. I prefer transparent over reservation if it can achieve the > same goal. Do we have a fully transparent method spec out? How to > achieve fully transparent and also avoid fragmentation caused by mix > order allocation/free? > > Keep in mind that we are still in the early stage of the mTHP swap > development, I can have the reservation patch relatively easily. If > you come up with a better transparent method patch which can achieve > the same goal later, we can use it instead. Because we are still in the early stage, I think that we should try to improve transparent solution firstly. Personally, what I don't like is that we don't work on the transparent solution because we have the reservation solution. >> >> >> that's really important for you, I think that it's better to design >> >> something like hugetlbfs vs core mm, that is, be separated from the >> >> normal swap subsystem as much as possible. >> > >> > I am giving hugetlbfs just to make the point using reservation, or >> > isolation of the resource to prevent mixing fragmentation existing in >> > core mm. >> > I am not suggesting copying the hugetlbfs implementation to the swap >> > system. Unlike hugetlbfs, the swap allocation is typically done from >> > the kernel, it is transparent from the application. I don't think >> > separate from the swap subsystem is a good way to go. >> > >> > This comes down to why you don't like the reservation. e.g. if we use >> > two swapfile, one swapfile is purely allocate for high order, would >> > that be better? >> >> Sorry, my words weren't accurate. Personally, I just think that it's >> better to make reservation related code not too intrusive. > > Yes. I will try to make it not too intrusive. > >> And, before reservation, we need to consider something else firstly. >> Whether is it generally good to swap-in with swap-out order? Should we > > When we have the reservation patch (or other means to sustain mix size > swap allocation/free), we can test it out to get more data to reason > about it. > I consider the swap in size policy an orthogonal issue. No. I don't think so. If you swap-out in higher order, but swap-in in lower order, you make the swap clusters fragmented. >> consider memory wastage too? One static policy doesn't fit all, we may >> need either a dynamic policy, or make policy configurable. >> In general, I think that we need to do this step by step. > > The core swap layer needs to be able to sustain mix size swap > allocation free in the long run. Without that the swap in size policy > is meaningless. > > Yes, that is the step by step approach. Allowing long run mix size > swap allocation as the first step. > >> >> >> > Do you see another way to protect the high order cluster polluted by >> >> >> > lower order one? >> >> >> >> >> >> If we use high-order page allocation as reference, we need something >> >> >> like compaction to guarantee high-order allocation finally. But we are >> >> >> too far from that. >> >> > >> >> > We should consider reservation for high-order swap entry allocation >> >> > similar to hugetlbfs for memory. >> >> > Swap compaction will be very complicated because it needs to scan the >> >> > PTE to migrate the swap entry. It might be easier to support folio >> >> > write out compound discontiguous swap entries. That is another way to >> >> > address the fragmentation issue. We are also too far from that as >> >> > right now. >> >> >> >> That's not easy to write out compound discontiguous swap entries too. >> >> For example, how to put folios in swap cache? >> > >> > I propose the idea in the recent LSF/MM discussion, the last few >> > slides are for the discontiguous swap and it has the discontiguous >> > entries in swap cache. >> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view >> > >> > Agree it is not an easy change. The cache cache would have to change >> > the assumption all offset are contiguous. >> > For swap, we kind of have some in memory data associated with per >> > offset already, so it might provide an opportunity to combine the >> > offset related data structure for swap together. Another alternative >> > might be using xarray without the multi entry property. , just treat >> > each offset like a single entry. I haven't dug deep into this >> > direction yet. >> >> Thanks! I will study your idea. >> > > I am happy to discuss if you have any questions. > >> > We can have more discussion, maybe arrange an upstream alignment >> > meeting if there is interest. >> >> Sure. > > Ideally, if we can resolve our differences over the mail list then we > don't need to have a separate meeting :-) > -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-13 8:38 ` Huang, Ying @ 2024-06-18 4:35 ` Chris Li 2024-06-18 6:54 ` Huang, Ying 0 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-06-18 4:35 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Thu, Jun 13, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Chris Li <chrisl@kernel.org> writes: > >> > >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Chris Li <chrisl@kernel.org> writes: > >> >> > >> > > >> >> > In the page allocation side, we have the hugetlbfs which reserve some > >> >> > memory for high order pages. > >> >> > We should have similar things to allow reserve some high order swap > >> >> > entries without getting polluted by low order one. > >> >> > >> >> TBH, I don't like the idea of high order swap entries reservation. > >> > May I know more if you don't like the idea? I understand this can be > >> > controversial, because previously we like to take the THP as the best > >> > effort approach. If there is some reason we can't make THP, we use the > >> > order 0 as fall back. > >> > > >> > For discussion purpose, I want break it down to smaller steps: > >> > > >> > First, can we agree that the following usage case is reasonable: > >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger > >> > size than 4K and can have both better compress ratio and CPU > >> > performance gain. > >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > >> > > >> > So the goal is to make THP/mTHP have some reasonable success rate > >> > running in the mix size swap allocation, after either low order or > >> > high order swap requests can overflow the swap file size. The allocate > >> > can still recover from that, after some swap entries got free. > >> > > >> > Please let me know if you think the above usage case and goal are not > >> > reasonable for the kernel. > >> > >> I think that it's reasonable to improve the success rate of high-order > > > > Glad to hear that. > > > >> swap entries allocation. I just think that it's hard to use the > >> reservation based method. For example, how much should be reserved? > > > > Understand, it is harder to use than a fully transparent method, but > > still better than no solution at all. The alternative right now is we > > can't do it. > > > > Regarding how much we should reserve. Similarly, how much should you > > choose your swap file size? If you choose N, why not N*120% or N*80%? > > That did not stop us from having a swapfile, right? > > > >> Why system OOM when there's still swap space available? And so forth. > > > > Keep in mind that the reservation is an option. If you prefer the old > > behavior, you don't have to use the reservation. That shouldn't be a > > reason to stop others who want to use it. We don't have an alternative > > solution for the long run mix size allocation yet. If there is, I like > > to hear it. > > It's not enough to make it optional. When you run into issue, you need > to debug it. And you may debug an issue on a system that is configured > by someone else. That is in general true with all kernel development regardless of using options or not. If there is a bug in my patch, I will need to debug and fix it or the patch might be reverted. I don't see that as a reason to take the option path or not. The option just means the user taking this option will need to understand the trade off and accept the defined behavior of that option. > > >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > > > > Me too. I prefer transparent over reservation if it can achieve the > > same goal. Do we have a fully transparent method spec out? How to > > achieve fully transparent and also avoid fragmentation caused by mix > > order allocation/free? > > > > Keep in mind that we are still in the early stage of the mTHP swap > > development, I can have the reservation patch relatively easily. If > > you come up with a better transparent method patch which can achieve > > the same goal later, we can use it instead. > > Because we are still in the early stage, I think that we should try to > improve transparent solution firstly. Personally, what I don't like is > that we don't work on the transparent solution because we have the > reservation solution. Do you have a road map or the design for the transparent solution you can share? I am interested to know what is the short term step(e.g. a month) in this transparent solution you have in mind, so we can compare the different approaches. I can't reason much just by the name "transparent solution" itself. Need more technical details. Right now we have a clear usage case we want to support, the swap in/out mTHP with bigger zsmalloc buffers. We can start with the limited usage case first then move to more general ones. > >> >> that's really important for you, I think that it's better to design > >> >> something like hugetlbfs vs core mm, that is, be separated from the > >> >> normal swap subsystem as much as possible. > >> > > >> > I am giving hugetlbfs just to make the point using reservation, or > >> > isolation of the resource to prevent mixing fragmentation existing in > >> > core mm. > >> > I am not suggesting copying the hugetlbfs implementation to the swap > >> > system. Unlike hugetlbfs, the swap allocation is typically done from > >> > the kernel, it is transparent from the application. I don't think > >> > separate from the swap subsystem is a good way to go. > >> > > >> > This comes down to why you don't like the reservation. e.g. if we use > >> > two swapfile, one swapfile is purely allocate for high order, would > >> > that be better? > >> > >> Sorry, my words weren't accurate. Personally, I just think that it's > >> better to make reservation related code not too intrusive. > > > > Yes. I will try to make it not too intrusive. > > > >> And, before reservation, we need to consider something else firstly. > >> Whether is it generally good to swap-in with swap-out order? Should we > > > > When we have the reservation patch (or other means to sustain mix size > > swap allocation/free), we can test it out to get more data to reason > > about it. > > I consider the swap in size policy an orthogonal issue. > > No. I don't think so. If you swap-out in higher order, but swap-in in > lower order, you make the swap clusters fragmented. Sounds like that is the reason to apply swap-in the same order of the swap out. In any case, my original point still stands. We need to have the ability to allocate high order swap entries with reasonable success rate *before* we have the option to choose which size to swap in. If allocating a high order swap always fails, we will be forced to use the low order one, there is no option to choose from. We can't evalute "is it generally good to swap-in with swap-out order?" by actual runs. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-18 4:35 ` Chris Li @ 2024-06-18 6:54 ` Huang, Ying 2024-06-18 9:31 ` Chris Li 0 siblings, 1 reply; 43+ messages in thread From: Huang, Ying @ 2024-06-18 6:54 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > On Thu, Jun 13, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > On Mon, Jun 10, 2024 at 7:38 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> > On Wed, Jun 5, 2024 at 7:02 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Chris Li <chrisl@kernel.org> writes: >> >> >> >> >> > >> >> >> > In the page allocation side, we have the hugetlbfs which reserve some >> >> >> > memory for high order pages. >> >> >> > We should have similar things to allow reserve some high order swap >> >> >> > entries without getting polluted by low order one. >> >> >> >> >> >> TBH, I don't like the idea of high order swap entries reservation. >> >> > May I know more if you don't like the idea? I understand this can be >> >> > controversial, because previously we like to take the THP as the best >> >> > effort approach. If there is some reason we can't make THP, we use the >> >> > order 0 as fall back. >> >> > >> >> > For discussion purpose, I want break it down to smaller steps: >> >> > >> >> > First, can we agree that the following usage case is reasonable: >> >> > The usage case is that, as Barry has shown, zsmalloc compresses bigger >> >> > size than 4K and can have both better compress ratio and CPU >> >> > performance gain. >> >> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ >> >> > >> >> > So the goal is to make THP/mTHP have some reasonable success rate >> >> > running in the mix size swap allocation, after either low order or >> >> > high order swap requests can overflow the swap file size. The allocate >> >> > can still recover from that, after some swap entries got free. >> >> > >> >> > Please let me know if you think the above usage case and goal are not >> >> > reasonable for the kernel. >> >> >> >> I think that it's reasonable to improve the success rate of high-order >> > >> > Glad to hear that. >> > >> >> swap entries allocation. I just think that it's hard to use the >> >> reservation based method. For example, how much should be reserved? >> > >> > Understand, it is harder to use than a fully transparent method, but >> > still better than no solution at all. The alternative right now is we >> > can't do it. >> > >> > Regarding how much we should reserve. Similarly, how much should you >> > choose your swap file size? If you choose N, why not N*120% or N*80%? >> > That did not stop us from having a swapfile, right? >> > >> >> Why system OOM when there's still swap space available? And so forth. >> > >> > Keep in mind that the reservation is an option. If you prefer the old >> > behavior, you don't have to use the reservation. That shouldn't be a >> > reason to stop others who want to use it. We don't have an alternative >> > solution for the long run mix size allocation yet. If there is, I like >> > to hear it. >> >> It's not enough to make it optional. When you run into issue, you need >> to debug it. And you may debug an issue on a system that is configured >> by someone else. > > That is in general true with all kernel development regardless of > using options or not. If there is a bug in my patch, I will need to > debug and fix it or the patch might be reverted. > > I don't see that as a reason to take the option path or not. The > option just means the user taking this option will need to understand > the trade off and accept the defined behavior of that option. User configuration knobs are not forbidden for Linux kernel. But we are more careful about them because they will introduce ABI which we need to maintain forever. And they are hard to be used for users. Optimizing automatically is generally the better solution. So, I suggest you to think more about the automatically solution before diving into a new option. >> >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> > >> > Me too. I prefer transparent over reservation if it can achieve the >> > same goal. Do we have a fully transparent method spec out? How to >> > achieve fully transparent and also avoid fragmentation caused by mix >> > order allocation/free? >> > >> > Keep in mind that we are still in the early stage of the mTHP swap >> > development, I can have the reservation patch relatively easily. If >> > you come up with a better transparent method patch which can achieve >> > the same goal later, we can use it instead. >> >> Because we are still in the early stage, I think that we should try to >> improve transparent solution firstly. Personally, what I don't like is >> that we don't work on the transparent solution because we have the >> reservation solution. > > Do you have a road map or the design for the transparent solution you can share? > I am interested to know what is the short term step(e.g. a month) in > this transparent solution you have in mind, so we can compare the > different approaches. I can't reason much just by the name > "transparent solution" itself. Need more technical details. > > Right now we have a clear usage case we want to support, the swap > in/out mTHP with bigger zsmalloc buffers. We can start with the > limited usage case first then move to more general ones. TBH, This is what I don't like. It appears that you refuse to think about the transparent (or automatic) solution. I haven't thought about them thoroughly, but at least we may think about - promoting low order non-full cluster when we find a free high order swap entries. - stealing a low order non-full cluster with low usage count for high-order allocation. - freeing more swap entries when swap devices become fragmented. >> >> >> that's really important for you, I think that it's better to design >> >> >> something like hugetlbfs vs core mm, that is, be separated from the >> >> >> normal swap subsystem as much as possible. >> >> > >> >> > I am giving hugetlbfs just to make the point using reservation, or >> >> > isolation of the resource to prevent mixing fragmentation existing in >> >> > core mm. >> >> > I am not suggesting copying the hugetlbfs implementation to the swap >> >> > system. Unlike hugetlbfs, the swap allocation is typically done from >> >> > the kernel, it is transparent from the application. I don't think >> >> > separate from the swap subsystem is a good way to go. >> >> > >> >> > This comes down to why you don't like the reservation. e.g. if we use >> >> > two swapfile, one swapfile is purely allocate for high order, would >> >> > that be better? >> >> >> >> Sorry, my words weren't accurate. Personally, I just think that it's >> >> better to make reservation related code not too intrusive. >> > >> > Yes. I will try to make it not too intrusive. >> > >> >> And, before reservation, we need to consider something else firstly. >> >> Whether is it generally good to swap-in with swap-out order? Should we >> > >> > When we have the reservation patch (or other means to sustain mix size >> > swap allocation/free), we can test it out to get more data to reason >> > about it. >> > I consider the swap in size policy an orthogonal issue. >> >> No. I don't think so. If you swap-out in higher order, but swap-in in >> lower order, you make the swap clusters fragmented. > > Sounds like that is the reason to apply swap-in the same order of the swap out. > In any case, my original point still stands. We need to have the > ability to allocate high order swap entries with reasonable success > rate *before* we have the option to choose which size to swap in. If > allocating a high order swap always fails, we will be forced to use > the low order one, there is no option to choose from. We can't evalute > "is it generally good to swap-in with swap-out order?" by actual runs. I think we don't need to fight for that. Just prove the value of your patchset with reasonable use cases and normal workloads. Data will persuade people. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-18 6:54 ` Huang, Ying @ 2024-06-18 9:31 ` Chris Li 2024-06-19 9:21 ` Huang, Ying 0 siblings, 1 reply; 43+ messages in thread From: Chris Li @ 2024-06-18 9:31 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@intel.com> wrote: > > Chris Li <chrisl@kernel.org> writes: > > > That is in general true with all kernel development regardless of > > using options or not. If there is a bug in my patch, I will need to > > debug and fix it or the patch might be reverted. > > > > I don't see that as a reason to take the option path or not. The > > option just means the user taking this option will need to understand > > the trade off and accept the defined behavior of that option. > > User configuration knobs are not forbidden for Linux kernel. But we are > more careful about them because they will introduce ABI which we need to > maintain forever. And they are hard to be used for users. Optimizing > automatically is generally the better solution. So, I suggest you to > think more about the automatically solution before diving into a new > option. I did, see my reply. Right now there are just no other options. > > >> > >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. > >> > > >> > Me too. I prefer transparent over reservation if it can achieve the > >> > same goal. Do we have a fully transparent method spec out? How to > >> > achieve fully transparent and also avoid fragmentation caused by mix > >> > order allocation/free? > >> > > >> > Keep in mind that we are still in the early stage of the mTHP swap > >> > development, I can have the reservation patch relatively easily. If > >> > you come up with a better transparent method patch which can achieve > >> > the same goal later, we can use it instead. > >> > >> Because we are still in the early stage, I think that we should try to > >> improve transparent solution firstly. Personally, what I don't like is > >> that we don't work on the transparent solution because we have the > >> reservation solution. > > > > Do you have a road map or the design for the transparent solution you can share? > > I am interested to know what is the short term step(e.g. a month) in > > this transparent solution you have in mind, so we can compare the > > different approaches. I can't reason much just by the name > > "transparent solution" itself. Need more technical details. > > > > Right now we have a clear usage case we want to support, the swap > > in/out mTHP with bigger zsmalloc buffers. We can start with the > > limited usage case first then move to more general ones. > > TBH, This is what I don't like. It appears that you refuse to think > about the transparent (or automatic) solution. Actually, that is not true, you make the wrong assumption about what I have considered. I want to find out what you have in mind to compare the near term solutions. In my recent LSF slide I already list 3 options to address this fragmentation problem. From easy to hard: 1) Assign cluster an order on allocation and remember the cluster order. (short term). That is this patch series 2) Buddy allocation on the swap entry (longer term) 3) Folio write out compound discontinuous swap entry. (ultimate) I also considered 4), which I did not put into the slide, because it is less effective than 3) 4) migrating the swap entries, which require scan page table entry. I briefly mentioned it during the session. 3) should might qualify as your transparent solution. It is just much harder to implement. Even when we have 3), having some form of 1) can be beneficial as well. (less IO count, no indirect layer of swap offset). > > I haven't thought about them thoroughly, but at least we may think about > > - promoting low order non-full cluster when we find a free high order > swap entries. > > - stealing a low order non-full cluster with low usage count for > high-order allocation. Now we are talking. These two above fall well within 2) the buddy allocators But the buddy allocator will not be able to address all fragmentation issues, due to the allocator not being controlled the life cycle of the swap entry. It will not help Barry's zsmalloc usage case much because android likes to keep the swapfile full. I can already see that. > - freeing more swap entries when swap devices become fragmented. That requires a scan page table to free the swap entry, basically 4). It is all about investment and return. 1) is relatively easy to implement and with good improvement and return. Chris > >> >> >> that's really important for you, I think that it's better to design > >> >> >> something like hugetlbfs vs core mm, that is, be separated from the > >> >> >> normal swap subsystem as much as possible. > >> >> > > >> >> > I am giving hugetlbfs just to make the point using reservation, or > >> >> > isolation of the resource to prevent mixing fragmentation existing in > >> >> > core mm. > >> >> > I am not suggesting copying the hugetlbfs implementation to the swap > >> >> > system. Unlike hugetlbfs, the swap allocation is typically done from > >> >> > the kernel, it is transparent from the application. I don't think > >> >> > separate from the swap subsystem is a good way to go. > >> >> > > >> >> > This comes down to why you don't like the reservation. e.g. if we use > >> >> > two swapfile, one swapfile is purely allocate for high order, would > >> >> > that be better? > >> >> > >> >> Sorry, my words weren't accurate. Personally, I just think that it's > >> >> better to make reservation related code not too intrusive. > >> > > >> > Yes. I will try to make it not too intrusive. > >> > > >> >> And, before reservation, we need to consider something else firstly. > >> >> Whether is it generally good to swap-in with swap-out order? Should we > >> > > >> > When we have the reservation patch (or other means to sustain mix size > >> > swap allocation/free), we can test it out to get more data to reason > >> > about it. > >> > I consider the swap in size policy an orthogonal issue. > >> > >> No. I don't think so. If you swap-out in higher order, but swap-in in > >> lower order, you make the swap clusters fragmented. > > > > Sounds like that is the reason to apply swap-in the same order of the swap out. > > In any case, my original point still stands. We need to have the > > ability to allocate high order swap entries with reasonable success > > rate *before* we have the option to choose which size to swap in. If > > allocating a high order swap always fails, we will be forced to use > > the low order one, there is no option to choose from. We can't evalute > > "is it generally good to swap-in with swap-out order?" by actual runs. > > I think we don't need to fight for that. Just prove the value of your > patchset with reasonable use cases and normal workloads. Data will > persuade people. ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-18 9:31 ` Chris Li @ 2024-06-19 9:21 ` Huang, Ying 0 siblings, 0 replies; 43+ messages in thread From: Huang, Ying @ 2024-06-19 9:21 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, linux-kernel, linux-mm, Barry Song Chris Li <chrisl@kernel.org> writes: > On Mon, Jun 17, 2024 at 11:56 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Chris Li <chrisl@kernel.org> writes: >> >> > That is in general true with all kernel development regardless of >> > using options or not. If there is a bug in my patch, I will need to >> > debug and fix it or the patch might be reverted. >> > >> > I don't see that as a reason to take the option path or not. The >> > option just means the user taking this option will need to understand >> > the trade off and accept the defined behavior of that option. >> >> User configuration knobs are not forbidden for Linux kernel. But we are >> more careful about them because they will introduce ABI which we need to >> maintain forever. And they are hard to be used for users. Optimizing >> automatically is generally the better solution. So, I suggest you to >> think more about the automatically solution before diving into a new >> option. > > I did, see my reply. Right now there are just no other options. > >> >> >> >> >> >> So, I prefer the transparent methods. Just like THP vs. hugetlbfs. >> >> > >> >> > Me too. I prefer transparent over reservation if it can achieve the >> >> > same goal. Do we have a fully transparent method spec out? How to >> >> > achieve fully transparent and also avoid fragmentation caused by mix >> >> > order allocation/free? >> >> > >> >> > Keep in mind that we are still in the early stage of the mTHP swap >> >> > development, I can have the reservation patch relatively easily. If >> >> > you come up with a better transparent method patch which can achieve >> >> > the same goal later, we can use it instead. >> >> >> >> Because we are still in the early stage, I think that we should try to >> >> improve transparent solution firstly. Personally, what I don't like is >> >> that we don't work on the transparent solution because we have the >> >> reservation solution. >> > >> > Do you have a road map or the design for the transparent solution you can share? >> > I am interested to know what is the short term step(e.g. a month) in >> > this transparent solution you have in mind, so we can compare the >> > different approaches. I can't reason much just by the name >> > "transparent solution" itself. Need more technical details. >> > >> > Right now we have a clear usage case we want to support, the swap >> > in/out mTHP with bigger zsmalloc buffers. We can start with the >> > limited usage case first then move to more general ones. >> >> TBH, This is what I don't like. It appears that you refuse to think >> about the transparent (or automatic) solution. > > Actually, that is not true, you make the wrong assumption about what I > have considered. I want to find out what you have in mind to compare > the near term solutions. Sorry about my wrong assumption. > In my recent LSF slide I already list 3 options to address this > fragmentation problem. > From easy to hard: > 1) Assign cluster an order on allocation and remember the cluster > order. (short term). > That is this patch series > 2) Buddy allocation on the swap entry (longer term) > 3) Folio write out compound discontinuous swap entry. (ultimate) > > I also considered 4), which I did not put into the slide, because it > is less effective than 3) > 4) migrating the swap entries, which require scan page table entry. > I briefly mentioned it during the session. Or you need something like a rmap, that isn't easy. > 3) should might qualify as your transparent solution. It is just much > harder to implement. > Even when we have 3), having some form of 1) can be beneficial as > well. (less IO count, no indirect layer of swap offset). > >> >> I haven't thought about them thoroughly, but at least we may think about >> >> - promoting low order non-full cluster when we find a free high order >> swap entries. >> >> - stealing a low order non-full cluster with low usage count for >> high-order allocation. > > Now we are talking. > These two above fall well within 2) the buddy allocators > But the buddy allocator will not be able to address all fragmentation > issues, due to the allocator not being controlled the life cycle of > the swap entry. > It will not help Barry's zsmalloc usage case much because android > likes to keep the swapfile full. I can already see that. I think that buddy-like allocator (not exactly buddy algorithm) will help fragmentation. And it will help more users because it works automatically. I don't think they are too hard to be implemented. We can try to find some simple solution firstly. So, I think that we don't need to push them to long term. At least, they can be done before introducing high-order cluster reservation ABI. Then, we can evaluate the benefit and overhead of reservation ABI. >> - freeing more swap entries when swap devices become fragmented. > > That requires a scan page table to free the swap entry, basically 4). No. You can just scan the page table of current process in do_swap_page() and try to swap-in and free more swap entries. That doesn't work well for the shared pages. However, I think that it can help quite some workloads. > It is all about investment and return. 1) is relatively easy to > implement and with good improvement and return. [snip] -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-28 21:04 ` Chris Li 2024-05-29 8:55 ` Huang, Ying @ 2024-05-30 7:49 ` Barry Song 2024-06-07 10:49 ` Ryan Roberts 1 sibling, 1 reply; 43+ messages in thread From: Barry Song @ 2024-05-30 7:49 UTC (permalink / raw) To: Chris Li Cc: Andrew Morton, Kairui Song, Ryan Roberts, Huang, Ying, linux-kernel, linux-mm [-- Attachment #1: Type: text/plain, Size: 6650 bytes --] On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote: > > I am spinning a new version for this series to address two issues > found in this series: > > 1) Oppo discovered a bug in the following line: > + ci = si->cluster_info + tmp; > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > That is a serious bug but trivial to fix. > > 2) order 0 allocation currently blindly scans swap_map disregarding > the cluster->order. Given enough order 0 swap allocations(close to the > swap file size) the order 0 allocation head will eventually sweep > across the whole swapfile and destroy other cluster order allocations. > > The short term fix is just skipping clusters that are already assigned > to higher orders. > > In the long term, I want to unify the non-SSD to use clusters for > locking and allocations as well, just try to follow the last > allocation (less seeking) as much as possible. Hi Chris, I am sharing some new test results with you. This time, we used two zRAM devices by modifying get_swap_pages(). zram0 -> dedicated for order-0 swpout zram1 -> dedicated for order-4 swpout We allocate a generous amount of space for zRAM1 to ensure it never gets full and always has ample free space. However, we found that Ryan's approach does not perform well even in this straightforward scenario. Despite zRAM1 having 80% of its space remaining, we still experience issues obtaining contiguous swap slots and encounter a high swpout_fallback ratio. Sorry for the report, Ryan :-) In contrast, with your patch, we consistently see the thp_swpout_fallback ratio at 0%, indicating a significant improvement in the situation. Although your patch still has issues supporting the mixing of order-0 and order-4 pages in a swap device, it represents a significant improvement. I would be delighted to witness your approach advancing with Ying Huang’s assistance. However, due to my current commitments, I regret that I am unable to allocate time for debugging. > > Chris > > > > On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: > > > > This is the short term solutiolns "swap cluster order" listed > > in my "Swap Abstraction" discussion slice 8 in the recent > > LSF/MM conference. > > > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > > orders" is introduced, it only allocates the mTHP swap entries > > from new empty cluster list. That works well for PMD size THP, > > but it has a serius fragmentation issue reported by Barry. > > > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > > > The mTHP allocation failure rate raises to almost 100% after a few > > hours in Barry's test run. > > > > The reason is that all the empty cluster has been exhausted while > > there are planty of free swap entries to in the cluster that is > > not 100% free. > > > > Address this by remember the swap allocation order in the cluster. > > Keep track of the per order non full cluster list for later allocation. > > > > This greatly improve the sucess rate of the mTHP swap allocation. > > While I am still waiting for Barry's test result. I paste Kairui's test > > result here: > > > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > > swapoff -a > > mkswap /dev/ram0 > > swapon /dev/ram0 > > > > rmdir /sys/fs/cgroup/benchmark > > mkdir -p /sys/fs/cgroup/benchmark > > cd /sys/fs/cgroup/benchmark > > echo 8G > memory.max > > echo $$ > cgroup.procs > > > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > > -P memcache_binary -n allkeys --key-minimum=1 \ > > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > > --ratio 1:0 --pipeline 8 -d 1024 > > > > Before: > > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > > After: > > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > > > And the fallback ratio dropped by a lot: > > Before: > > hugepages-32kB/stats/anon_swpout_fallback:15997 > > hugepages-32kB/stats/anon_swpout:18712 > > hugepages-512kB/stats/anon_swpout_fallback:192 > > hugepages-512kB/stats/anon_swpout:0 > > hugepages-2048kB/stats/anon_swpout_fallback:2 > > hugepages-2048kB/stats/anon_swpout:0 > > hugepages-1024kB/stats/anon_swpout_fallback:0 > > hugepages-1024kB/stats/anon_swpout:0 > > hugepages-64kB/stats/anon_swpout_fallback:18246 > > hugepages-64kB/stats/anon_swpout:17644 > > hugepages-16kB/stats/anon_swpout_fallback:13701 > > hugepages-16kB/stats/anon_swpout:18234 > > hugepages-256kB/stats/anon_swpout_fallback:8642 > > hugepages-256kB/stats/anon_swpout:93 > > hugepages-128kB/stats/anon_swpout_fallback:21497 > > hugepages-128kB/stats/anon_swpout:7596 > > > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > > > After: > > hugepages-32kB/stats/swpout:34445 > > hugepages-32kB/stats/swpout_fallback:0 > > hugepages-512kB/stats/swpout:1 > > hugepages-512kB/stats/swpout_fallback:134 > > hugepages-2048kB/stats/swpout:1 > > hugepages-2048kB/stats/swpout_fallback:1 > > hugepages-1024kB/stats/swpout:6 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:35495 > > hugepages-64kB/stats/swpout_fallback:0 > > hugepages-16kB/stats/swpout:32441 > > hugepages-16kB/stats/swpout_fallback:0 > > hugepages-256kB/stats/swpout:2223 > > hugepages-256kB/stats/swpout_fallback:6278 > > hugepages-128kB/stats/swpout:29136 > > hugepages-128kB/stats/swpout_fallback:52 > > > > Reported-by: Barry Song <21cnbao@gmail.com> > > Tested-by: Kairui Song <kasong@tencent.com> > > Signed-off-by: Chris Li <chrisl@kernel.org> > > --- > > Chris Li (2): > > mm: swap: swap cluster switch to double link list > > mm: swap: mTHP allocate swap entries from nonfull list > > > > include/linux/swap.h | 18 ++-- > > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > > 2 files changed, 93 insertions(+), 177 deletions(-) > > --- > > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > > change-id: 20240523-swap-allocator-1534c480ece4 > > > > Best regards, > > -- > > Chris Li <chrisl@kernel.org> > > Thanks Barry [-- Attachment #2: chris-swap-patch-dedicated-zram.png --] [-- Type: image/png, Size: 38784 bytes --] ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-30 7:49 ` Barry Song @ 2024-06-07 10:49 ` Ryan Roberts 2024-06-07 18:57 ` Chris Li 0 siblings, 1 reply; 43+ messages in thread From: Ryan Roberts @ 2024-06-07 10:49 UTC (permalink / raw) To: Barry Song, Chris Li Cc: Andrew Morton, Kairui Song, Huang, Ying, linux-kernel, linux-mm On 30/05/2024 08:49, Barry Song wrote: > On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote: >> >> I am spinning a new version for this series to address two issues >> found in this series: >> >> 1) Oppo discovered a bug in the following line: >> + ci = si->cluster_info + tmp; >> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". >> That is a serious bug but trivial to fix. >> >> 2) order 0 allocation currently blindly scans swap_map disregarding >> the cluster->order. Given enough order 0 swap allocations(close to the >> swap file size) the order 0 allocation head will eventually sweep >> across the whole swapfile and destroy other cluster order allocations. >> >> The short term fix is just skipping clusters that are already assigned >> to higher orders. >> >> In the long term, I want to unify the non-SSD to use clusters for >> locking and allocations as well, just try to follow the last >> allocation (less seeking) as much as possible. > > Hi Chris, > > I am sharing some new test results with you. This time, we used two > zRAM devices by modifying get_swap_pages(). > > zram0 -> dedicated for order-0 swpout > zram1 -> dedicated for order-4 swpout > > We allocate a generous amount of space for zRAM1 to ensure it never gets full > and always has ample free space. However, we found that Ryan's approach > does not perform well even in this straightforward scenario. Despite zRAM1 > having 80% of its space remaining, we still experience issues obtaining > contiguous swap slots and encounter a high swpout_fallback ratio. > > Sorry for the report, Ryan :-) No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty sure that this is due to fragmentation preventing clusters from being freed back to the free list. > > In contrast, with your patch, we consistently see the thp_swpout_fallback ratio > at 0%, indicating a significant improvement in the situation. Unless I've misunderstood something critical, Chris's change is just allowing a cpu to steal a block from another cpu's current cluster for that order. So it just takes longer (approx by a factor of the number of cpus in the system) to get to the state where fragmentation is causing fallbacks? As I said in the other thread, I think the more robust solution is to implement scanning for high order blocks. > > Although your patch still has issues supporting the mixing of order-0 and > order-4 pages in a swap device, it represents a significant improvement. > > I would be delighted to witness your approach advancing with Ying > Huang’s assistance. However, due to my current commitments, I > regret that I am unable to allocate time for debugging. > >> >> Chris >> >> >> >> On Fri, May 24, 2024 at 10:17 AM Chris Li <chrisl@kernel.org> wrote: >>> >>> This is the short term solutiolns "swap cluster order" listed >>> in my "Swap Abstraction" discussion slice 8 in the recent >>> LSF/MM conference. >>> >>> When commit 845982eb264bc "mm: swap: allow storage of all mTHP >>> orders" is introduced, it only allocates the mTHP swap entries >>> from new empty cluster list. That works well for PMD size THP, >>> but it has a serius fragmentation issue reported by Barry. >>> >>> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ >>> >>> The mTHP allocation failure rate raises to almost 100% after a few >>> hours in Barry's test run. >>> >>> The reason is that all the empty cluster has been exhausted while >>> there are planty of free swap entries to in the cluster that is >>> not 100% free. >>> >>> Address this by remember the swap allocation order in the cluster. >>> Keep track of the per order non full cluster list for later allocation. >>> >>> This greatly improve the sucess rate of the mTHP swap allocation. >>> While I am still waiting for Barry's test result. I paste Kairui's test >>> result here: >>> >>> I'm able to reproduce such an issue with a simple script (enabling all order of mthp): >>> >>> modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) >>> swapoff -a >>> mkswap /dev/ram0 >>> swapon /dev/ram0 >>> >>> rmdir /sys/fs/cgroup/benchmark >>> mkdir -p /sys/fs/cgroup/benchmark >>> cd /sys/fs/cgroup/benchmark >>> echo 8G > memory.max >>> echo $$ > cgroup.procs >>> >>> memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & >>> >>> /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ >>> -P memcache_binary -n allkeys --key-minimum=1 \ >>> --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ >>> --ratio 1:0 --pipeline 8 -d 1024 >>> >>> Before: >>> Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 >>> After: >>> Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 >>> >>> And the fallback ratio dropped by a lot: >>> Before: >>> hugepages-32kB/stats/anon_swpout_fallback:15997 >>> hugepages-32kB/stats/anon_swpout:18712 >>> hugepages-512kB/stats/anon_swpout_fallback:192 >>> hugepages-512kB/stats/anon_swpout:0 >>> hugepages-2048kB/stats/anon_swpout_fallback:2 >>> hugepages-2048kB/stats/anon_swpout:0 >>> hugepages-1024kB/stats/anon_swpout_fallback:0 >>> hugepages-1024kB/stats/anon_swpout:0 >>> hugepages-64kB/stats/anon_swpout_fallback:18246 >>> hugepages-64kB/stats/anon_swpout:17644 >>> hugepages-16kB/stats/anon_swpout_fallback:13701 >>> hugepages-16kB/stats/anon_swpout:18234 >>> hugepages-256kB/stats/anon_swpout_fallback:8642 >>> hugepages-256kB/stats/anon_swpout:93 >>> hugepages-128kB/stats/anon_swpout_fallback:21497 >>> hugepages-128kB/stats/anon_swpout:7596 >>> >>> (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) >>> >>> After: >>> hugepages-32kB/stats/swpout:34445 >>> hugepages-32kB/stats/swpout_fallback:0 >>> hugepages-512kB/stats/swpout:1 >>> hugepages-512kB/stats/swpout_fallback:134 >>> hugepages-2048kB/stats/swpout:1 >>> hugepages-2048kB/stats/swpout_fallback:1 >>> hugepages-1024kB/stats/swpout:6 >>> hugepages-1024kB/stats/swpout_fallback:0 >>> hugepages-64kB/stats/swpout:35495 >>> hugepages-64kB/stats/swpout_fallback:0 >>> hugepages-16kB/stats/swpout:32441 >>> hugepages-16kB/stats/swpout_fallback:0 >>> hugepages-256kB/stats/swpout:2223 >>> hugepages-256kB/stats/swpout_fallback:6278 >>> hugepages-128kB/stats/swpout:29136 >>> hugepages-128kB/stats/swpout_fallback:52 >>> >>> Reported-by: Barry Song <21cnbao@gmail.com> >>> Tested-by: Kairui Song <kasong@tencent.com> >>> Signed-off-by: Chris Li <chrisl@kernel.org> >>> --- >>> Chris Li (2): >>> mm: swap: swap cluster switch to double link list >>> mm: swap: mTHP allocate swap entries from nonfull list >>> >>> include/linux/swap.h | 18 ++-- >>> mm/swapfile.c | 252 +++++++++++++++++---------------------------------- >>> 2 files changed, 93 insertions(+), 177 deletions(-) >>> --- >>> base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed >>> change-id: 20240523-swap-allocator-1534c480ece4 >>> >>> Best regards, >>> -- >>> Chris Li <chrisl@kernel.org> >>> > > Thanks > Barry ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-07 10:49 ` Ryan Roberts @ 2024-06-07 18:57 ` Chris Li 0 siblings, 0 replies; 43+ messages in thread From: Chris Li @ 2024-06-07 18:57 UTC (permalink / raw) To: Ryan Roberts Cc: Barry Song, Andrew Morton, Kairui Song, Huang, Ying, linux-kernel, linux-mm On Fri, Jun 7, 2024 at 3:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 30/05/2024 08:49, Barry Song wrote: > > On Wed, May 29, 2024 at 9:04 AM Chris Li <chrisl@kernel.org> wrote: > >> > >> I am spinning a new version for this series to address two issues > >> found in this series: > >> > >> 1) Oppo discovered a bug in the following line: > >> + ci = si->cluster_info + tmp; > >> Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp". > >> That is a serious bug but trivial to fix. > >> > >> 2) order 0 allocation currently blindly scans swap_map disregarding > >> the cluster->order. Given enough order 0 swap allocations(close to the > >> swap file size) the order 0 allocation head will eventually sweep > >> across the whole swapfile and destroy other cluster order allocations. > >> > >> The short term fix is just skipping clusters that are already assigned > >> to higher orders. > >> > >> In the long term, I want to unify the non-SSD to use clusters for > >> locking and allocations as well, just try to follow the last > >> allocation (less seeking) as much as possible. > > > > Hi Chris, > > > > I am sharing some new test results with you. This time, we used two > > zRAM devices by modifying get_swap_pages(). > > > > zram0 -> dedicated for order-0 swpout > > zram1 -> dedicated for order-4 swpout > > > > We allocate a generous amount of space for zRAM1 to ensure it never gets full > > and always has ample free space. However, we found that Ryan's approach > > does not perform well even in this straightforward scenario. Despite zRAM1 > > having 80% of its space remaining, we still experience issues obtaining > > contiguous swap slots and encounter a high swpout_fallback ratio. > > > > Sorry for the report, Ryan :-) > > No problem; clearly it needs to be fixed, and I'll help where I can. I'm pretty > sure that this is due to fragmentation preventing clusters from being freed back > to the free list. > > > > > In contrast, with your patch, we consistently see the thp_swpout_fallback ratio > > at 0%, indicating a significant improvement in the situation. > > Unless I've misunderstood something critical, Chris's change is just allowing a > cpu to steal a block from another cpu's current cluster for that order. So it No, that is not the main change. The main change is to allow the CPU to allocate from the nonfull and non-empty cluster, which are not in any CPU's current cluster, not in the empty list either. The current patch does not prevent the CPU from stealing from the other CPU current order. It will get addressed in V2. > just takes longer (approx by a factor of the number of cpus in the system) to > get to the state where fragmentation is causing fallbacks? As I said in the > other thread, I think the more robust solution is to implement scanning for high > order blocks. That will introduce more fragmentation to the high order cluster, and will make it harder to allocate high order swap entry later. Please see my previous email for the usage case and the goal of the change. https://lore.kernel.org/linux-mm/CANeU7QnVzqGKXp9pKDLWiuhqTvBxXupgFCRXejYhshAjw6uDyQ@mail.gmail.com/T/#mf431a743e458896c2ab4a4077b103341313c9cf4 Let's discuss whether the usage case and the goal makes sense or not. Chris ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-05-24 17:17 [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li ` (3 preceding siblings ...) 2024-05-28 21:04 ` Chris Li @ 2024-06-07 9:43 ` Ryan Roberts 2024-06-07 18:48 ` Chris Li 4 siblings, 1 reply; 43+ messages in thread From: Ryan Roberts @ 2024-06-07 9:43 UTC (permalink / raw) To: Chris Li, Andrew Morton Cc: Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just getting through my mail now... On 24/05/2024 18:17, Chris Li wrote: > This is the short term solutiolns "swap cluster order" listed > in my "Swap Abstraction" discussion slice 8 in the recent > LSF/MM conference. I've read the article on lwn and look forward to watching the video once available. The longer term plans look interesting. > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > orders" is introduced, it only allocates the mTHP swap entries > from new empty cluster list. That works well for PMD size THP, > but it has a serius fragmentation issue reported by Barry. Yes, that was a deliberate initial approach to be conservative, just like the original PMD-size THP support. I'm glad to see work to improve the situation! > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > The mTHP allocation failure rate raises to almost 100% after a few > hours in Barry's test run. > > The reason is that all the empty cluster has been exhausted while > there are planty of free swap entries to in the cluster that is > not 100% free. > > Address this by remember the swap allocation order in the cluster. > Keep track of the per order non full cluster list for later allocation. I don't immediately see how this helps because memory is swapped back in per-page (currently), so just because a given cluster was initially filled with entries of a given order, doesn't mean that those entries are freed in atomic units; only specific pages could have been swapped back in, meaning the holes are not of the required order. Additionally, scanning could lead to order-0 pages being populated in random places. My naive assumption was that the obvious way to solve this problem in the short term would be to extend the scanning logic to be able to scan for an arbitrary order. That way you could find an allocation of the required order in any of the clusters, even a cluster that was not originally allocated for the required order. I guess I should read your patches to understand exactly what you are doing rather than making assumptions... Thanks, Ryan > > This greatly improve the sucess rate of the mTHP swap allocation. > While I am still waiting for Barry's test result. I paste Kairui's test > result here: > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > swapoff -a > mkswap /dev/ram0 > swapon /dev/ram0 > > rmdir /sys/fs/cgroup/benchmark > mkdir -p /sys/fs/cgroup/benchmark > cd /sys/fs/cgroup/benchmark > echo 8G > memory.max > echo $$ > cgroup.procs > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > -P memcache_binary -n allkeys --key-minimum=1 \ > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > --ratio 1:0 --pipeline 8 -d 1024 > > Before: > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > After: > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > And the fallback ratio dropped by a lot: > Before: > hugepages-32kB/stats/anon_swpout_fallback:15997 > hugepages-32kB/stats/anon_swpout:18712 > hugepages-512kB/stats/anon_swpout_fallback:192 > hugepages-512kB/stats/anon_swpout:0 > hugepages-2048kB/stats/anon_swpout_fallback:2 > hugepages-2048kB/stats/anon_swpout:0 > hugepages-1024kB/stats/anon_swpout_fallback:0 > hugepages-1024kB/stats/anon_swpout:0 > hugepages-64kB/stats/anon_swpout_fallback:18246 > hugepages-64kB/stats/anon_swpout:17644 > hugepages-16kB/stats/anon_swpout_fallback:13701 > hugepages-16kB/stats/anon_swpout:18234 > hugepages-256kB/stats/anon_swpout_fallback:8642 > hugepages-256kB/stats/anon_swpout:93 > hugepages-128kB/stats/anon_swpout_fallback:21497 > hugepages-128kB/stats/anon_swpout:7596 > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > After: > hugepages-32kB/stats/swpout:34445 > hugepages-32kB/stats/swpout_fallback:0 > hugepages-512kB/stats/swpout:1 > hugepages-512kB/stats/swpout_fallback:134 > hugepages-2048kB/stats/swpout:1 > hugepages-2048kB/stats/swpout_fallback:1 > hugepages-1024kB/stats/swpout:6 > hugepages-1024kB/stats/swpout_fallback:0 > hugepages-64kB/stats/swpout:35495 > hugepages-64kB/stats/swpout_fallback:0 > hugepages-16kB/stats/swpout:32441 > hugepages-16kB/stats/swpout_fallback:0 > hugepages-256kB/stats/swpout:2223 > hugepages-256kB/stats/swpout_fallback:6278 > hugepages-128kB/stats/swpout:29136 > hugepages-128kB/stats/swpout_fallback:52 > > Reported-by: Barry Song <21cnbao@gmail.com> > Tested-by: Kairui Song <kasong@tencent.com> > Signed-off-by: Chris Li <chrisl@kernel.org> > --- > Chris Li (2): > mm: swap: swap cluster switch to double link list > mm: swap: mTHP allocate swap entries from nonfull list > > include/linux/swap.h | 18 ++-- > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > 2 files changed, 93 insertions(+), 177 deletions(-) > --- > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > change-id: 20240523-swap-allocator-1534c480ece4 > > Best regards, ^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order 2024-06-07 9:43 ` Ryan Roberts @ 2024-06-07 18:48 ` Chris Li 0 siblings, 0 replies; 43+ messages in thread From: Chris Li @ 2024-06-07 18:48 UTC (permalink / raw) To: Ryan Roberts Cc: Andrew Morton, Kairui Song, Huang, Ying, linux-kernel, linux-mm, Barry Song On Fri, Jun 7, 2024 at 2:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Sorry I'm late to the discussion - I've been out for the last 3.5 weeks and just > getting through my mail now... No problem at all, please take it easy. > > > On 24/05/2024 18:17, Chris Li wrote: > > This is the short term solutiolns "swap cluster order" listed > > in my "Swap Abstraction" discussion slice 8 in the recent > > LSF/MM conference. > > I've read the article on lwn and look forward to watching the video once > available. The longer term plans look interesting. > > > > > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > > orders" is introduced, it only allocates the mTHP swap entries > > from new empty cluster list. That works well for PMD size THP, > > but it has a serius fragmentation issue reported by Barry. > > Yes, that was a deliberate initial approach to be conservative, just like the > original PMD-size THP support. I'm glad to see work to improve the situation! > > > > > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/ > > > > The mTHP allocation failure rate raises to almost 100% after a few > > hours in Barry's test run. > > > > The reason is that all the empty cluster has been exhausted while > > there are planty of free swap entries to in the cluster that is > > not 100% free. > > > > Address this by remember the swap allocation order in the cluster. > > Keep track of the per order non full cluster list for later allocation. > > I don't immediately see how this helps because memory is swapped back in > per-page (currently), so just because a given cluster was initially filled with That is not the case for Barry's setup, he has some other patch series to swap in mTHP as a whole. Especially in for the mTHP store in the zsmalloc as bigger than 4K pages. https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/ > entries of a given order, doesn't mean that those entries are freed in atomic > units; only specific pages could have been swapped back in, meaning the holes > are not of the required order. Additionally, scanning could lead to order-0 > pages being populated in random places. Yes, that is a problem we need to address. The proposed short term solution is to have an isolation scheme preventing the high order swap entry mix with the lower order one inside one cluster. That is easy to do and has some test results confirming the reservation/isolation effect. > > My naive assumption was that the obvious way to solve this problem in the short > term would be to extend the scanning logic to be able to scan for an arbitrary > order. That way you could find an allocation of the required order in any of the > clusters, even a cluster that was not originally allocated for the required order. > > I guess I should read your patches to understand exactly what you are doing > rather than making assumptions... Scanning is not enough. We need to have some way to prevent the fragmentation from happening. Once the fragmentation has happened, it can't be easily reversed. Scanning does not help the fragmentation aspect. Chris > > Thanks, > Ryan > > > > > This greatly improve the sucess rate of the mTHP swap allocation. > > While I am still waiting for Barry's test result. I paste Kairui's test > > result here: > > > > I'm able to reproduce such an issue with a simple script (enabling all order of mthp): > > > > modprobe brd rd_nr=1 rd_size=$(( 10 * 1024 * 1024)) > > swapoff -a > > mkswap /dev/ram0 > > swapon /dev/ram0 > > > > rmdir /sys/fs/cgroup/benchmark > > mkdir -p /sys/fs/cgroup/benchmark > > cd /sys/fs/cgroup/benchmark > > echo 8G > memory.max > > echo $$ > cgroup.procs > > > > memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 -t 32 -B binary & > > > > /usr/local/bin/memtier_benchmark -S /tmp/memcached.socket \ > > -P memcache_binary -n allkeys --key-minimum=1 \ > > --key-maximum=18000000 --key-pattern=P:P -c 1 -t 32 \ > > --ratio 1:0 --pipeline 8 -d 1024 > > > > Before: > > Totals 48805.63 0.00 0.00 5.26045 1.19100 38.91100 59.64700 51063.98 > > After: > > Totals 71098.84 0.00 0.00 3.60585 0.71100 26.36700 39.16700 74388.74 > > > > And the fallback ratio dropped by a lot: > > Before: > > hugepages-32kB/stats/anon_swpout_fallback:15997 > > hugepages-32kB/stats/anon_swpout:18712 > > hugepages-512kB/stats/anon_swpout_fallback:192 > > hugepages-512kB/stats/anon_swpout:0 > > hugepages-2048kB/stats/anon_swpout_fallback:2 > > hugepages-2048kB/stats/anon_swpout:0 > > hugepages-1024kB/stats/anon_swpout_fallback:0 > > hugepages-1024kB/stats/anon_swpout:0 > > hugepages-64kB/stats/anon_swpout_fallback:18246 > > hugepages-64kB/stats/anon_swpout:17644 > > hugepages-16kB/stats/anon_swpout_fallback:13701 > > hugepages-16kB/stats/anon_swpout:18234 > > hugepages-256kB/stats/anon_swpout_fallback:8642 > > hugepages-256kB/stats/anon_swpout:93 > > hugepages-128kB/stats/anon_swpout_fallback:21497 > > hugepages-128kB/stats/anon_swpout:7596 > > > > (Still collecting more data, the success swpout was mostly done early, then the fallback began to increase, nearly 100% failure rate) > > > > After: > > hugepages-32kB/stats/swpout:34445 > > hugepages-32kB/stats/swpout_fallback:0 > > hugepages-512kB/stats/swpout:1 > > hugepages-512kB/stats/swpout_fallback:134 > > hugepages-2048kB/stats/swpout:1 > > hugepages-2048kB/stats/swpout_fallback:1 > > hugepages-1024kB/stats/swpout:6 > > hugepages-1024kB/stats/swpout_fallback:0 > > hugepages-64kB/stats/swpout:35495 > > hugepages-64kB/stats/swpout_fallback:0 > > hugepages-16kB/stats/swpout:32441 > > hugepages-16kB/stats/swpout_fallback:0 > > hugepages-256kB/stats/swpout:2223 > > hugepages-256kB/stats/swpout_fallback:6278 > > hugepages-128kB/stats/swpout:29136 > > hugepages-128kB/stats/swpout_fallback:52 > > > > Reported-by: Barry Song <21cnbao@gmail.com> > > Tested-by: Kairui Song <kasong@tencent.com> > > Signed-off-by: Chris Li <chrisl@kernel.org> > > --- > > Chris Li (2): > > mm: swap: swap cluster switch to double link list > > mm: swap: mTHP allocate swap entries from nonfull list > > > > include/linux/swap.h | 18 ++-- > > mm/swapfile.c | 252 +++++++++++++++++---------------------------------- > > 2 files changed, 93 insertions(+), 177 deletions(-) > > --- > > base-commit: c65920c76a977c2b73c3a8b03b4c0c00cc1285ed > > change-id: 20240523-swap-allocator-1534c480ece4 > > > > Best regards, > ^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2024-06-19 9:23 UTC | newest] Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-05-24 17:17 [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li 2024-05-24 17:17 ` [PATCH 1/2] mm: swap: swap cluster switch to double link list Chris Li 2024-05-28 16:23 ` Kairui Song 2024-05-28 22:27 ` Chris Li 2024-05-29 0:50 ` Chris Li 2024-05-29 8:46 ` Huang, Ying 2024-05-30 21:49 ` Chris Li 2024-05-31 2:03 ` Huang, Ying 2024-05-24 17:17 ` [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li 2024-06-07 10:35 ` Ryan Roberts 2024-06-07 10:57 ` Ryan Roberts 2024-06-07 20:53 ` Chris Li 2024-06-07 20:52 ` Chris Li 2024-06-10 11:18 ` Ryan Roberts 2024-06-11 6:09 ` Chris Li 2024-05-28 3:07 ` [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Barry Song 2024-05-28 21:04 ` Chris Li 2024-05-29 8:55 ` Huang, Ying 2024-05-30 1:13 ` Chris Li 2024-05-30 2:52 ` Huang, Ying 2024-05-30 8:08 ` Kairui Song 2024-05-30 18:31 ` Chris Li 2024-05-30 21:44 ` Chris Li 2024-05-31 2:35 ` Huang, Ying 2024-05-31 12:40 ` Kairui Song 2024-06-04 7:27 ` Huang, Ying 2024-06-05 7:40 ` Chris Li 2024-06-05 7:30 ` Chris Li 2024-06-05 7:08 ` Chris Li 2024-06-06 1:55 ` Huang, Ying 2024-06-07 18:40 ` Chris Li 2024-06-11 2:36 ` Huang, Ying 2024-06-11 7:11 ` Chris Li 2024-06-13 8:38 ` Huang, Ying 2024-06-18 4:35 ` Chris Li 2024-06-18 6:54 ` Huang, Ying 2024-06-18 9:31 ` Chris Li 2024-06-19 9:21 ` Huang, Ying 2024-05-30 7:49 ` Barry Song 2024-06-07 10:49 ` Ryan Roberts 2024-06-07 18:57 ` Chris Li 2024-06-07 9:43 ` Ryan Roberts 2024-06-07 18:48 ` Chris Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox