Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Ryan Roberts <ryan.roberts@arm.com>
To: Chris Li <chrisl@kernel.org>, Andrew Morton <akpm@linux-foundation.org>
Cc: Kairui Song <kasong@tencent.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Barry Song <baohua@kernel.org>
Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list
Date: Fri, 7 Jun 2024 11:57:52 +0100	[thread overview]
Message-ID: <7553070e-630e-4e86-b64e-66cfce1ee125@arm.com> (raw)
In-Reply-To: <edb439ea-4754-4d63-8d5f-edc116465d7b@arm.com>

On 07/06/2024 11:35, Ryan Roberts wrote:
> On 24/05/2024 18:17, Chris Li wrote:
>> Track the nonfull cluster as well as the empty cluster
>> on lists. Each order has one nonfull cluster list.
>>
>> The cluster will remember which order it was used during
>> new cluster allocation.
>>
>> When the cluster has free entry, add to the nonfull[order]
>> list.  When the free cluster list is empty, also allocate
>> from the nonempty list of that order.
>>
>> This improves the mTHP swap allocation success rate.
> 
> If I've understood correctly, the aim here is to link all the current per-cpu
> clusters for a given order together so that if a cpu can't allocate a new
> cluster for a given order, then it can steal another CPU's current cluster for
> that order?
> 
> If that's the intent, couldn't that be done just by iterating over the per-cpu,
> per-order cluster pointers? Then you don't need all the linked list churn
> (althogh I like the linked list changes as a nice cleanup, I'm not sure the
> churn is neccessary for this change?). There would likely need to be some
> locking considerations, but it would also allow you to get access to the next
> entry within the cluster for allocation.
> 
> However, fundamentally, I don't think this change solves the problem; it just
> takes a bit longer before the allocation fails. The real problem is
> fragmentation due to freeing individual pages from swap entries at different times.
> 
> Wouldn't it be better to just extend scanning to support high order allocations?
> Then we can steal a high order block from any cluster, even clusters that were
> previously full, just like we currently do for order-0. Given we are already
> falling back to this path for order-0, I don't think it would be any more
> expensive; infact its less expensive because we only scan once for the high
> order block, rather than scan for every split order-0 page.
> 
> Of course that still doesn't solve the proplem entirely; if swap is so
> fragmented that there is no contiguous block of the required order then you
> still have to fall back to splitting. As an extra optimization, you could store
> the largest contiguous free space available in each cluster to avoid scanning in
> case its too small?
> 
> 
>>
>> There are limitations if the distribution of numbers of
>> different orders of mTHP changes a lot. e.g. there are a lot
>> of nonfull cluster assign to order A while later time there
>> are a lot of order B allocation while very little allocation
>> in order A. Currently the cluster used by order A will not
>> reused by order B unless the cluster is 100% empty.
>>
>> This situation is best addressed by the longer term "swap
>> buddy allocator", in future patches.
>> ---
>>  include/linux/swap.h |  4 ++++
>>  mm/swapfile.c        | 25 +++++++++++++++++++++++--
>>  2 files changed, 27 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 0d3906eff3c9..1b7f0794b9bf 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -255,10 +255,12 @@ struct swap_cluster_info {
>>  				 * cluster
>>  				 */
>>  	unsigned int count:16;
>> +	unsigned int order:8;
>>  	unsigned int flags:8;
>>  	struct list_head next;
>>  };
>>  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>>  
>>  
>>  /*
>> @@ -297,6 +299,8 @@ struct swap_info_struct {
>>  	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
>>  	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>  	struct list_head free_clusters; /* free clusters list */
>> +	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>> +					/* list of cluster that contains at least one free slot */
>>  	unsigned int lowest_bit;	/* index of first free in swap_map */
>>  	unsigned int highest_bit;	/* index of last free in swap_map */
>>  	unsigned int pages;		/* total of usable pages of swap */
>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> index 205a60c5f9cb..51923aba500e 100644
>> --- a/mm/swapfile.c
>> +++ b/mm/swapfile.c
>> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>  
>>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>>  {
>> +	if (ci->flags & CLUSTER_FLAG_NONFULL)
>> +		list_move_tail(&ci->next, &si->free_clusters);
>> +	else
>> +		list_add_tail(&ci->next, &si->free_clusters);
>>  	ci->flags = CLUSTER_FLAG_FREE;
>> -	list_add_tail(&ci->next, &si->free_clusters);
>>  }
>>  
>>  /*
>> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>>  	ci->count--;
>>  
>>  	if (!ci->count)
>> -		free_cluster(p, ci);
>> +		return free_cluster(p, ci);
>> +
>> +	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> +		list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]);
>> +		ci->flags |= CLUSTER_FLAG_NONFULL;
>> +	}
>>  }
>>  
>>  /*
>> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next);
>>  			list_del(&ci->next);
>>  			spin_lock(&ci->lock);
>> +			ci->order = order;
>> +			ci->flags = 0;
>> +			spin_unlock(&ci->lock);
>> +			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
>> +		} else if (!list_empty(&si->nonfull_clusters[order])) {
>> +			ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next);
>> +			list_del(&ci->next);
>> +			spin_lock(&ci->lock);
>>  			ci->flags = 0;
>>  			spin_unlock(&ci->lock);
>>  			tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER;
> 
> This looks wrong to me; if the cluster is on the nonfull list then it will have
> had some entries already allocated (by another cpu). So pointing tmp to the
> first block in the cluster will never yield a free block. The cpu from which you
> are stealing the cluster stores the next free block location in its per-cpu
> structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is
> a better approach than the nonfull list?

Ahh; of course the cluster scan below will move this along to a free block.

> 
> Additionally, this cluster will be stored back to this cpu's current cluster at
> the bottom of the function. That may or may not be what you intended.
> 
>> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>  				break;
>>  			tmp += nr_pages;
>>  		}
>> +		WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order);
>>  		unlock_cluster(ci);
>>  	}
>>  	if (tmp >= max) {
>> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>  	ci = lock_cluster(si, offset);
>>  	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>>  	ci->count = 0;
>> +	ci->order = 0;
>>  	ci->flags = 0;
>>  	free_cluster(si, ci);
>>  	unlock_cluster(ci);
>> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>>  	INIT_LIST_HEAD(&p->free_clusters);
>>  	INIT_LIST_HEAD(&p->discard_clusters);
>>  
>> +	for (i = 0; i < SWAP_NR_ORDERS; i++)
>> +		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
>> +
>>  	for (i = 0; i < swap_header->info.nr_badpages; i++) {
>>  		unsigned int page_nr = swap_header->info.badpages[i];
>>  		if (page_nr == 0 || page_nr > swap_header->info.last_page)
>>
>

next prev parent reply	other threads:[~2024-06-07 10:57 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-24 17:17 [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-05-24 17:17 ` [PATCH 1/2] mm: swap: swap cluster switch to double link list Chris Li
2024-05-28 16:23   ` Kairui Song
2024-05-28 22:27     ` Chris Li
2024-05-29  0:50       ` Chris Li
2024-05-29  8:46   ` Huang, Ying
2024-05-30 21:49     ` Chris Li
2024-05-31  2:03       ` Huang, Ying
2024-05-24 17:17 ` [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
2024-06-07 10:35   ` Ryan Roberts
2024-06-07 10:57     ` Ryan Roberts [this message]
2024-06-07 20:53       ` Chris Li
2024-06-07 20:52     ` Chris Li
2024-06-10 11:18       ` Ryan Roberts
2024-06-11  6:09         ` Chris Li
2024-05-28  3:07 ` [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order Barry Song
2024-05-28 21:04 ` Chris Li
2024-05-29  8:55   ` Huang, Ying
2024-05-30  1:13     ` Chris Li
2024-05-30  2:52       ` Huang, Ying
2024-05-30  8:08         ` Kairui Song
2024-05-30 18:31           ` Chris Li
2024-05-30 21:44         ` Chris Li
2024-05-31  2:35           ` Huang, Ying
2024-05-31 12:40             ` Kairui Song
2024-06-04  7:27               ` Huang, Ying
2024-06-05  7:40                 ` Chris Li
2024-06-05  7:30               ` Chris Li
2024-06-05  7:08             ` Chris Li
2024-06-06  1:55               ` Huang, Ying
2024-06-07 18:40                 ` Chris Li
2024-06-11  2:36                   ` Huang, Ying
2024-06-11  7:11                     ` Chris Li
2024-06-13  8:38                       ` Huang, Ying
2024-06-18  4:35                         ` Chris Li
2024-06-18  6:54                           ` Huang, Ying
2024-06-18  9:31                             ` Chris Li
2024-06-19  9:21                               ` Huang, Ying
2024-05-30  7:49   ` Barry Song
2024-06-07 10:49     ` Ryan Roberts
2024-06-07 18:57       ` Chris Li
2024-06-07  9:43 ` Ryan Roberts
2024-06-07 18:48   ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7553070e-630e-4e86-b64e-66cfce1ee125@arm.com \
    --to=ryan.roberts@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=chrisl@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox