[PATCH 0/2] mm: store zero pages to be swapped out in a bitmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm: store zero pages to be swapped out in a bitmap
@ 2024-05-30 10:19 Usama Arif
  2024-05-30 10:19 ` [PATCH 1/2] " Usama Arif
  2024-05-30 10:19 ` [PATCH 2/2] mm: remove code to handle same filled pages Usama Arif
  0 siblings, 2 replies; 12+ messages in thread
From: Usama Arif @ 2024-05-30 10:19 UTC (permalink / raw)
  To: akpm
  Cc: hannes, yosryahmed, nphamcs, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Usama Arif

As shown in the patchseries that introduced the zswap same-filled
optimization [1], 10-20% of the pages stored in zswap are same-filled.
This is also observed across Meta's server fleet.
By using VM counters in swap_writepage (not included in this
patchseries) it was found that less than 1% of the same-filled
pages to be swapped out are non-zero pages.

For conventional swap setup (without zswap), rather than reading/writing
these pages to flash resulting in increased I/O and flash wear, a bitmap
can be used to mark these pages as zero at write time, and the pages can
be filled at read time if the bit corresponding to the page is set.

When using zswap with swap, this also means that a zswap_entry does not
need to be allocated for zero filled pages resulting in memory savings
which would offset the memory used for the bitmap.

A similar attempt was made earlier in [2] where zswap would only track
zero-filled pages instead of same-filled.
This patchseries adds zero-filled pages optimization to swap
(hence it can be used even if zswap is disabled) and removes the
same-filled code from zswap (as only 1% of the same-filled pages are
non-zero), simplifying code.

This patchseries is based on mm-unstable.

[1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
[2] https://lore.kernel.org/lkml/20240325235018.2028408-1-yosryahmed@google.com/

Usama Arif (2):
  mm: store zero pages to be swapped out in a bitmap
  mm: remove code to handle same filled pages

 include/linux/swap.h |  1 +
 mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
 mm/swapfile.c        | 10 ++++++
 mm/zswap.c           | 79 ++++------------------------------------
 4 files changed, 102 insertions(+), 74 deletions(-)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 10:19 [PATCH 0/2] mm: store zero pages to be swapped out in a bitmap Usama Arif
@ 2024-05-30 10:19 ` Usama Arif
  2024-05-30 12:27   ` Johannes Weiner
                     ` (2 more replies)
  2024-05-30 10:19 ` [PATCH 2/2] mm: remove code to handle same filled pages Usama Arif
  1 sibling, 3 replies; 12+ messages in thread
From: Usama Arif @ 2024-05-30 10:19 UTC (permalink / raw)
  To: akpm
  Cc: hannes, yosryahmed, nphamcs, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Usama Arif

Approximately 10-20% of pages to be swapped out are zero pages [1].
Rather than reading/writing these pages to flash resulting
in increased I/O and flash wear, a bitmap can be used to mark these
pages as zero at write time, and the pages can be filled at
read time if the bit corresponding to the page is set.
With this patch, NVMe writes in Meta server fleet decreased
by almost 10% with conventional swap setup (zswap disabled).

[1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/swap.h |  1 +
 mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
 mm/swapfile.c        | 10 ++++++
 3 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a11c75e897ec..e88563978441 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -299,6 +299,7 @@ struct swap_info_struct {
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
+	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct swap_cluster_list free_clusters; /* free clusters list */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
diff --git a/mm/page_io.c b/mm/page_io.c
index a360857cf75d..ab043b4ad577 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -172,6 +172,77 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 	goto out;
 }
 
+static bool is_folio_page_zero_filled(struct folio *folio, int i)
+{
+	unsigned long *page;
+	unsigned int pos;
+	bool ret = false;
+
+	page = kmap_local_folio(folio, i * PAGE_SIZE);
+	for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) {
+		if (page[pos] != 0)
+			goto out;
+	}
+	ret = true;
+out:
+	kunmap_local(page);
+	return ret;
+}
+
+static bool is_folio_zero_filled(struct folio *folio)
+{
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		if (!is_folio_page_zero_filled(folio, i))
+			return false;
+	}
+	return true;
+}
+
+static void folio_page_zero_fill(struct folio *folio, int i)
+{
+	unsigned long *page;
+
+	page = kmap_local_folio(folio, i * PAGE_SIZE);
+	memset_l(page, 0, PAGE_SIZE / sizeof(unsigned long));
+	kunmap_local(page);
+}
+
+static void folio_zero_fill(struct folio *folio)
+{
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++)
+		folio_page_zero_fill(folio, i);
+}
+
+static void swap_zeromap_folio_set(struct folio *folio)
+{
+	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_entry_t entry;
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		entry = page_swap_entry(folio_page(folio, i));
+		bitmap_set(sis->zeromap, swp_offset(entry), 1);
+	}
+}
+
+static bool swap_zeromap_folio_test(struct folio *folio)
+{
+	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_entry_t entry;
+	unsigned int i;
+
+	for (i = 0; i < folio_nr_pages(folio); i++) {
+		entry = page_swap_entry(folio_page(folio, i));
+		if (!test_bit(swp_offset(entry), sis->zeromap))
+			return false;
+	}
+	return true;
+}
+
 /*
  * We may have stale swap cache pages in memory: notice
  * them here and get rid of the unnecessary final write.
@@ -195,6 +266,14 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		folio_unlock(folio);
 		return ret;
 	}
+
+	if (is_folio_zero_filled(folio)) {
+		swap_zeromap_folio_set(folio);
+		folio_start_writeback(folio);
+		folio_unlock(folio);
+		folio_end_writeback(folio);
+		return 0;
+	}
 	if (zswap_store(folio)) {
 		folio_start_writeback(folio);
 		folio_unlock(folio);
@@ -515,8 +594,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
 		psi_memstall_enter(&pflags);
 	}
 	delayacct_swapin_start();
-
-	if (zswap_load(folio)) {
+	if (swap_zeromap_folio_test(folio)) {
+		folio_zero_fill(folio);
+		folio_mark_uptodate(folio);
+		folio_unlock(folio);
+	} else if (zswap_load(folio)) {
 		folio_mark_uptodate(folio);
 		folio_unlock(folio);
 	} else if (data_race(sis->flags & SWP_FS_OPS)) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f1e559e216bd..3f00a1cce5ba 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -461,6 +461,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	 */
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
+	bitmap_clear(si->zeromap, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER);
 
 	cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
 
@@ -498,6 +499,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 		__free_cluster(si, idx);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		bitmap_clear(si->zeromap, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER);
 		unlock_cluster(ci);
 	}
 }
@@ -1336,6 +1338,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	count = p->swap_map[offset];
 	VM_BUG_ON(count != SWAP_HAS_CACHE);
 	p->swap_map[offset] = 0;
+	bitmap_clear(p->zeromap, offset, 1);
 	dec_cluster_info_page(p, p->cluster_info, offset);
 	unlock_cluster(ci);
 
@@ -2597,6 +2600,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	free_percpu(p->cluster_next_cpu);
 	p->cluster_next_cpu = NULL;
 	vfree(swap_map);
+	bitmap_free(p->zeromap);
 	kvfree(cluster_info);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
@@ -3123,6 +3127,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 
+	p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
+	if (!p->zeromap) {
+		error = -ENOMEM;
+		goto bad_swap_unlock_inode;
+	}
+
 	if (p->bdev && bdev_stable_writes(p->bdev))
 		p->flags |= SWP_STABLE_WRITES;
 
-- 
2.43.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/2] mm: remove code to handle same filled pages
  2024-05-30 10:19 [PATCH 0/2] mm: store zero pages to be swapped out in a bitmap Usama Arif
  2024-05-30 10:19 ` [PATCH 1/2] " Usama Arif
@ 2024-05-30 10:19 ` Usama Arif
  1 sibling, 0 replies; 12+ messages in thread
From: Usama Arif @ 2024-05-30 10:19 UTC (permalink / raw)
  To: akpm
  Cc: hannes, yosryahmed, nphamcs, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Usama Arif

With an earlier commit to handle zero-filled pages in swap directly,
and with only 1% of the same-filled pages being non-zero, zswap no
longer needs to handle same-filled pages and can just work on compressed
pages.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 mm/zswap.c | 79 +++++-------------------------------------------------
 1 file changed, 7 insertions(+), 72 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index b9b35ef86d9b..50c8d402516f 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -44,8 +44,6 @@
 **********************************/
 /* The number of compressed pages currently stored in zswap */
 atomic_t zswap_stored_pages = ATOMIC_INIT(0);
-/* The number of same-value filled pages currently stored in zswap */
-static atomic_t zswap_same_filled_pages = ATOMIC_INIT(0);
 
 /*
  * The statistics below are not protected from concurrent access for
@@ -182,11 +180,9 @@ static struct shrinker *zswap_shrinker;
  *
  * swpentry - associated swap entry, the offset indexes into the red-black tree
  * length - the length in bytes of the compressed page data.  Needed during
- *          decompression. For a same value filled page length is 0, and both
- *          pool and lru are invalid and must be ignored.
+ *          decompression.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
  * objcg - the obj_cgroup that the compressed memory is charged to
  * lru - handle to the pool's lru used to evict pages.
  */
@@ -194,10 +190,7 @@ struct zswap_entry {
 	swp_entry_t swpentry;
 	unsigned int length;
 	struct zswap_pool *pool;
-	union {
-		unsigned long handle;
-		unsigned long value;
-	};
+	unsigned long handle;
 	struct obj_cgroup *objcg;
 	struct list_head lru;
 };
@@ -814,13 +807,9 @@ static struct zpool *zswap_find_zpool(struct zswap_entry *entry)
  */
 static void zswap_entry_free(struct zswap_entry *entry)
 {
-	if (!entry->length)
-		atomic_dec(&zswap_same_filled_pages);
-	else {
-		zswap_lru_del(&zswap_list_lru, entry);
-		zpool_free(zswap_find_zpool(entry), entry->handle);
-		zswap_pool_put(entry->pool);
-	}
+	zswap_lru_del(&zswap_list_lru, entry);
+	zpool_free(zswap_find_zpool(entry), entry->handle);
+	zswap_pool_put(entry->pool);
 	if (entry->objcg) {
 		obj_cgroup_uncharge_zswap(entry->objcg, entry->length);
 		obj_cgroup_put(entry->objcg);
@@ -1262,11 +1251,6 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	 * This ensures that the better zswap compresses memory, the fewer
 	 * pages we will evict to swap (as it will otherwise incur IO for
 	 * relatively small memory saving).
-	 *
-	 * The memory saving factor calculated here takes same-filled pages into
-	 * account, but those are not freeable since they almost occupy no
-	 * space. Hence, we may scale nr_freeable down a little bit more than we
-	 * should if we have a lot of same-filled pages.
 	 */
 	return mult_frac(nr_freeable, nr_backing, nr_stored);
 }
@@ -1370,42 +1354,6 @@ static void shrink_worker(struct work_struct *w)
 	} while (zswap_total_pages() > thr);
 }
 
-/*********************************
-* same-filled functions
-**********************************/
-static bool zswap_is_folio_same_filled(struct folio *folio, unsigned long *value)
-{
-	unsigned long *data;
-	unsigned long val;
-	unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1;
-	bool ret = false;
-
-	data = kmap_local_folio(folio, 0);
-	val = data[0];
-
-	if (val != data[last_pos])
-		goto out;
-
-	for (pos = 1; pos < last_pos; pos++) {
-		if (val != data[pos])
-			goto out;
-	}
-
-	*value = val;
-	ret = true;
-out:
-	kunmap_local(data);
-	return ret;
-}
-
-static void zswap_fill_folio(struct folio *folio, unsigned long value)
-{
-	unsigned long *data = kmap_local_folio(folio, 0);
-
-	memset_l(data, value, PAGE_SIZE / sizeof(unsigned long));
-	kunmap_local(data);
-}
-
 /*********************************
 * main API
 **********************************/
@@ -1450,13 +1398,6 @@ bool zswap_store(struct folio *folio)
 		goto reject;
 	}
 
-	if (zswap_is_folio_same_filled(folio, &value)) {
-		entry->length = 0;
-		entry->value = value;
-		atomic_inc(&zswap_same_filled_pages);
-		goto store_entry;
-	}
-
 	/* if entry is successfully added, it keeps the reference */
 	entry->pool = zswap_pool_current_get();
 	if (!entry->pool)
@@ -1522,13 +1463,9 @@ bool zswap_store(struct folio *folio)
 	return true;
 
 store_failed:
-	if (!entry->length)
-		atomic_dec(&zswap_same_filled_pages);
-	else {
-		zpool_free(zswap_find_zpool(entry), entry->handle);
+	zpool_free(zswap_find_zpool(entry), entry->handle);
 put_pool:
-		zswap_pool_put(entry->pool);
-	}
+	zswap_pool_put(entry->pool);
 freepage:
 	zswap_entry_cache_free(entry);
 reject:
@@ -1682,8 +1619,6 @@ static int zswap_debugfs_init(void)
 			    zswap_debugfs_root, NULL, &total_size_fops);
 	debugfs_create_atomic_t("stored_pages", 0444,
 				zswap_debugfs_root, &zswap_stored_pages);
-	debugfs_create_atomic_t("same_filled_pages", 0444,
-				zswap_debugfs_root, &zswap_same_filled_pages);
 
 	return 0;
 }
-- 
2.43.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 10:19 ` [PATCH 1/2] " Usama Arif
@ 2024-05-30 12:27   ` Johannes Weiner
  2024-05-30 16:24     ` Yosry Ahmed
  2024-05-30 16:20   ` Yosry Ahmed
  2024-05-30 19:58   ` Andrew Morton
  2 siblings, 1 reply; 12+ messages in thread
From: Johannes Weiner @ 2024-05-30 12:27 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, yosryahmed, nphamcs, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Hugh Dickins, Huang Ying

On Thu, May 30, 2024 at 11:19:07AM +0100, Usama Arif wrote:
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).
> 
> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

This is awesome.

> ---
>  include/linux/swap.h |  1 +
>  mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
>  mm/swapfile.c        | 10 ++++++
>  3 files changed, 95 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a11c75e897ec..e88563978441 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -299,6 +299,7 @@ struct swap_info_struct {
>  	signed char	type;		/* strange name for an index */
>  	unsigned int	max;		/* extent of the swap_map */
>  	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
> +	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */

One bit per swap slot, so 1 / (4096 * 8) = 0.003% static memory
overhead for configured swap space. That seems reasonable for what
appears to be a fairly universal 10% reduction in swap IO.

An alternative implementation would be to reserve a bit in
swap_map. This would be no overhead at idle, but would force
continuation counts earlier on heavily shared page tables, and AFAICS
would get complicated in terms of locking, whereas this one is pretty
simple (atomic ops protect the map, swapcache lock protects the bit).

So I prefer this version. But a few comments below:

>  	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>  	struct swap_cluster_list free_clusters; /* free clusters list */
>  	unsigned int lowest_bit;	/* index of first free in swap_map */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a360857cf75d..ab043b4ad577 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,77 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>  	goto out;
>  }
>  
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> +	unsigned long *page;
> +	unsigned int pos;
> +	bool ret = false;
> +
> +	page = kmap_local_folio(folio, i * PAGE_SIZE);
> +	for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) {
> +		if (page[pos] != 0)
> +			goto out;
> +	}
> +	ret = true;
> +out:
> +	kunmap_local(page);
> +	return ret;
> +}
> +
> +static bool is_folio_zero_filled(struct folio *folio)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < folio_nr_pages(folio); i++) {
> +		if (!is_folio_page_zero_filled(folio, i))
> +			return false;
> +	}
> +	return true;
> +}
> +
> +static void folio_page_zero_fill(struct folio *folio, int i)
> +{
> +	unsigned long *page;
> +
> +	page = kmap_local_folio(folio, i * PAGE_SIZE);
> +	memset_l(page, 0, PAGE_SIZE / sizeof(unsigned long));
> +	kunmap_local(page);
> +}
> +
> +static void folio_zero_fill(struct folio *folio)
> +{
> +	unsigned int i;
> +
> +	for (i = 0; i < folio_nr_pages(folio); i++)
> +		folio_page_zero_fill(folio, i);
> +}
> +
> +static void swap_zeromap_folio_set(struct folio *folio)
> +{
> +	struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +	swp_entry_t entry;
> +	unsigned int i;
> +
> +	for (i = 0; i < folio_nr_pages(folio); i++) {
> +		entry = page_swap_entry(folio_page(folio, i));
> +		bitmap_set(sis->zeromap, swp_offset(entry), 1);

This should be set_bit(). bitmap_set() isn't atomic, so it would
corrupt the map on concurrent swapping of other zero pages. And you
don't need a range op here anyway.

> +	}
> +}
> +
> +static bool swap_zeromap_folio_test(struct folio *folio)
> +{
> +	struct swap_info_struct *sis = swp_swap_info(folio->swap);
> +	swp_entry_t entry;
> +	unsigned int i;
> +
> +	for (i = 0; i < folio_nr_pages(folio); i++) {
> +		entry = page_swap_entry(folio_page(folio, i));
> +		if (!test_bit(swp_offset(entry), sis->zeromap))
> +			return false;
> +	}
> +	return true;
> +}
> +
>  /*
>   * We may have stale swap cache pages in memory: notice
>   * them here and get rid of the unnecessary final write.
> @@ -195,6 +266,14 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>  		folio_unlock(folio);
>  		return ret;
>  	}
> +
> +	if (is_folio_zero_filled(folio)) {
> +		swap_zeromap_folio_set(folio);
> +		folio_start_writeback(folio);
> +		folio_unlock(folio);
> +		folio_end_writeback(folio);
> +		return 0;
> +	}

You need to clear the zeromap bit in the else branch.

Userspace can change the contents of a swapcached page, which
redirties the page and forces an overwrite of the slot when the page
is reclaimed again.

So if the page goes from zeroes to something else and then gets
reclaimed again, a subsequent swapin would read the stale zeroes.

>  	if (zswap_store(folio)) {
>  		folio_start_writeback(folio);
>  		folio_unlock(folio);
> @@ -515,8 +594,11 @@ void swap_read_folio(struct folio *folio, bool synchronous,
>  		psi_memstall_enter(&pflags);
>  	}
>  	delayacct_swapin_start();
> -
> -	if (zswap_load(folio)) {
> +	if (swap_zeromap_folio_test(folio)) {
> +		folio_zero_fill(folio);
> +		folio_mark_uptodate(folio);
> +		folio_unlock(folio);
> +	} else if (zswap_load(folio)) {
>  		folio_mark_uptodate(folio);
>  		folio_unlock(folio);
>  	} else if (data_race(sis->flags & SWP_FS_OPS)) {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f1e559e216bd..3f00a1cce5ba 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -461,6 +461,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  	 */
>  	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> +	bitmap_clear(si->zeromap, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER);

AFAICS this needs to be atomic as well. The swap_info and cluster are
locked, which stabilizes si->swap_map, but zeromap can see updates
from concurrent swap_writepage() and swap_read_folio() on other slots.

I think you need to use a loop over clear_bit(). Please also add a
comment with the above.

>  
>  	cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
>  
> @@ -498,6 +499,7 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>  		__free_cluster(si, idx);
>  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  				0, SWAPFILE_CLUSTER);
> +		bitmap_clear(si->zeromap, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER);

Same.

>  		unlock_cluster(ci);
>  	}
>  }
> @@ -1336,6 +1338,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
>  	count = p->swap_map[offset];
>  	VM_BUG_ON(count != SWAP_HAS_CACHE);
>  	p->swap_map[offset] = 0;
> +	bitmap_clear(p->zeromap, offset, 1);

This too needs to be atomic, IOW clear_bit().

Otherwise this looks good to me.

>  	dec_cluster_info_page(p, p->cluster_info, offset);
>  	unlock_cluster(ci);
>  
> @@ -2597,6 +2600,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	free_percpu(p->cluster_next_cpu);
>  	p->cluster_next_cpu = NULL;
>  	vfree(swap_map);
> +	bitmap_free(p->zeromap);
>  	kvfree(cluster_info);
>  	/* Destroy swap account information */
>  	swap_cgroup_swapoff(p->type);
> @@ -3123,6 +3127,12 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		goto bad_swap_unlock_inode;
>  	}
>  
> +	p->zeromap = bitmap_zalloc(maxpages, GFP_KERNEL);
> +	if (!p->zeromap) {
> +		error = -ENOMEM;
> +		goto bad_swap_unlock_inode;
> +	}
> +
>  	if (p->bdev && bdev_stable_writes(p->bdev))
>  		p->flags |= SWP_STABLE_WRITES;
>  
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 10:19 ` [PATCH 1/2] " Usama Arif
  2024-05-30 12:27   ` Johannes Weiner
@ 2024-05-30 16:20   ` Yosry Ahmed
  2024-05-30 19:58   ` Andrew Morton
  2 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2024-05-30 16:20 UTC (permalink / raw)
  To: Usama Arif
  Cc: akpm, hannes, nphamcs, chengming.zhou, linux-mm, linux-kernel,
	kernel-team, Ying

On Thu, May 30, 2024 at 3:21 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).

Great work. I thought about doing this after my attempt to drop some
of the same-filled pages handling in zswap, now we can drop all of it
:)

Make sure to CC other non-zswap folks on next iterations like Ying. I
see Johannes already did that in his response. I suspect
get_maintainers will give you a few extra names.

>
> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/swap.h |  1 +
>  mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
>  mm/swapfile.c        | 10 ++++++
>  3 files changed, 95 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a11c75e897ec..e88563978441 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -299,6 +299,7 @@ struct swap_info_struct {
>         signed char     type;           /* strange name for an index */
>         unsigned int    max;            /* extent of the swap_map */
>         unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> +       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>         struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>         struct swap_cluster_list free_clusters; /* free clusters list */
>         unsigned int lowest_bit;        /* index of first free in swap_map */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a360857cf75d..ab043b4ad577 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,77 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>         goto out;
>  }
>
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> +       unsigned long *page;

I just recently renamed this variable in the zswap version of this
function because it is very confusing, especially when looking for
struct page references. 'page' is usually used for struct page. Let's
use a different name here.

> +       unsigned int pos;
> +       bool ret = false;
> +
> +       page = kmap_local_folio(folio, i * PAGE_SIZE);

In the zswap version, we compare against the end of the page first
before the loop, in case the page just has a bunch of zeros at its
beginning. The patch that added it to zswap reported better
performance in some cases [1].

You can also use memchr_inv() to compare the range against 0 instead
of the loop.

[1]https://lore.kernel.org/all/20230205190036.1730134-1-taejoon.song@lge.com/

> +       for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) {
> +               if (page[pos] != 0)
> +                       goto out;
> +       }
> +       ret = true;
> +out:
> +       kunmap_local(page);
> +       return ret;
> +}
> +
> +static bool is_folio_zero_filled(struct folio *folio)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++) {
> +               if (!is_folio_page_zero_filled(folio, i))
> +                       return false;
> +       }
> +       return true;
> +}

Is there value in splitting this into two functions and having a
separate loop here? Can we just have a single function that kmaps the
entire folio and loops over it in one go?

> +
> +static void folio_page_zero_fill(struct folio *folio, int i)
> +{
> +       unsigned long *page;
> +
> +       page = kmap_local_folio(folio, i * PAGE_SIZE);
> +       memset_l(page, 0, PAGE_SIZE / sizeof(unsigned long));
> +       kunmap_local(page);
> +}
> +
> +static void folio_zero_fill(struct folio *folio)
> +{
> +       unsigned int i;
> +
> +       for (i = 0; i < folio_nr_pages(folio); i++)
> +               folio_page_zero_fill(folio, i);

I think you can just use clear_highpage() here and drop
folio_page_zero_fill(). It should be more optimized as well in some
cases.

I don't have further comments about the rest beyond Johannes's comments.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 12:27   ` Johannes Weiner
@ 2024-05-30 16:24     ` Yosry Ahmed
  2024-05-30 19:18       ` Nhat Pham
  2024-05-30 20:04       ` Matthew Wilcox
  0 siblings, 2 replies; 12+ messages in thread
From: Yosry Ahmed @ 2024-05-30 16:24 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Usama Arif, akpm, nphamcs, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Hugh Dickins, Huang Ying

On Thu, May 30, 2024 at 5:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, May 30, 2024 at 11:19:07AM +0100, Usama Arif wrote:
> > Approximately 10-20% of pages to be swapped out are zero pages [1].
> > Rather than reading/writing these pages to flash resulting
> > in increased I/O and flash wear, a bitmap can be used to mark these
> > pages as zero at write time, and the pages can be filled at
> > read time if the bit corresponding to the page is set.
> > With this patch, NVMe writes in Meta server fleet decreased
> > by almost 10% with conventional swap setup (zswap disabled).
> >
> > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> >
> > Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>
> This is awesome.
>
> > ---
> >  include/linux/swap.h |  1 +
> >  mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
> >  mm/swapfile.c        | 10 ++++++
> >  3 files changed, 95 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index a11c75e897ec..e88563978441 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -299,6 +299,7 @@ struct swap_info_struct {
> >       signed char     type;           /* strange name for an index */
> >       unsigned int    max;            /* extent of the swap_map */
> >       unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> > +     unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
>
> One bit per swap slot, so 1 / (4096 * 8) = 0.003% static memory
> overhead for configured swap space. That seems reasonable for what
> appears to be a fairly universal 10% reduction in swap IO.
>
> An alternative implementation would be to reserve a bit in
> swap_map. This would be no overhead at idle, but would force
> continuation counts earlier on heavily shared page tables, and AFAICS
> would get complicated in terms of locking, whereas this one is pretty
> simple (atomic ops protect the map, swapcache lock protects the bit).
>
> So I prefer this version. But a few comments below:

I am wondering if it's even possible to take this one step further and
avoid reclaiming zero-filled pages in the first place. Can we just
unmap them and let the first read fault allocate a zero'd page like
uninitialized memory, or point them at the zero page and make them
read-only, or something? Then we could free them directly without
going into the swap code to begin with.

That's how I thought about it initially when I attempted to support
only zero-filled pages in zswap. It could be a more complex
implementation though.

[..]
> > +
> > +static void swap_zeromap_folio_set(struct folio *folio)
> > +{
> > +     struct swap_info_struct *sis = swp_swap_info(folio->swap);
> > +     swp_entry_t entry;
> > +     unsigned int i;
> > +
> > +     for (i = 0; i < folio_nr_pages(folio); i++) {
> > +             entry = page_swap_entry(folio_page(folio, i));
> > +             bitmap_set(sis->zeromap, swp_offset(entry), 1);
>
> This should be set_bit(). bitmap_set() isn't atomic, so it would
> corrupt the map on concurrent swapping of other zero pages. And you
> don't need a range op here anyway.

It's a shame there is no range version of set_bit(). I suspect we can
save a few atomic operations on large folios if we write them in
chunks rather than one by one.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 16:24     ` Yosry Ahmed
@ 2024-05-30 19:18       ` Nhat Pham
  2024-05-30 19:49         ` Yosry Ahmed
  2024-05-30 20:04       ` Matthew Wilcox
  1 sibling, 1 reply; 12+ messages in thread
From: Nhat Pham @ 2024-05-30 19:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Usama Arif, akpm, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Hugh Dickins, Huang Ying

On Thu, May 30, 2024 at 9:24 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, May 30, 2024 at 5:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, May 30, 2024 at 11:19:07AM +0100, Usama Arif wrote:
> > > Approximately 10-20% of pages to be swapped out are zero pages [1].
> > > Rather than reading/writing these pages to flash resulting
> > > in increased I/O and flash wear, a bitmap can be used to mark these
> > > pages as zero at write time, and the pages can be filled at
> > > read time if the bit corresponding to the page is set.
> > > With this patch, NVMe writes in Meta server fleet decreased
> > > by almost 10% with conventional swap setup (zswap disabled).
> > >
> > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> > >
> > > Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >
> > This is awesome.
> >
> > > ---
> > >  include/linux/swap.h |  1 +
> > >  mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
> > >  mm/swapfile.c        | 10 ++++++
> > >  3 files changed, 95 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index a11c75e897ec..e88563978441 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -299,6 +299,7 @@ struct swap_info_struct {
> > >       signed char     type;           /* strange name for an index */
> > >       unsigned int    max;            /* extent of the swap_map */
> > >       unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> > > +     unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
> >
> > One bit per swap slot, so 1 / (4096 * 8) = 0.003% static memory
> > overhead for configured swap space. That seems reasonable for what
> > appears to be a fairly universal 10% reduction in swap IO.
> >
> > An alternative implementation would be to reserve a bit in
> > swap_map. This would be no overhead at idle, but would force
> > continuation counts earlier on heavily shared page tables, and AFAICS
> > would get complicated in terms of locking, whereas this one is pretty
> > simple (atomic ops protect the map, swapcache lock protects the bit).
> >
> > So I prefer this version. But a few comments below:
>
> I am wondering if it's even possible to take this one step further and
> avoid reclaiming zero-filled pages in the first place. Can we just
> unmap them and let the first read fault allocate a zero'd page like
> uninitialized memory, or point them at the zero page and make them
> read-only, or something? Then we could free them directly without
> going into the swap code to begin with.
>
> That's how I thought about it initially when I attempted to support
> only zero-filled pages in zswap. It could be a more complex
> implementation though.

We can aim for this eventually, but yeah the implementation will be
more complex. We'll need to be careful in handling shared zero pages,
synchronizing accesses and maintaining reference counts. I think we
will need to special-case swap cache and swap map for these zero pages
(a ghost zero swap device perhaps), or reinvent the wheel to manage
these pieces of information.

Not impossible, but annoying :) For now, I think Usama's approach is
clean enough and does the job.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 19:18       ` Nhat Pham
@ 2024-05-30 19:49         ` Yosry Ahmed
  0 siblings, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2024-05-30 19:49 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Usama Arif, akpm, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Hugh Dickins, Huang Ying

On Thu, May 30, 2024 at 12:18 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, May 30, 2024 at 9:24 AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, May 30, 2024 at 5:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Thu, May 30, 2024 at 11:19:07AM +0100, Usama Arif wrote:
> > > > Approximately 10-20% of pages to be swapped out are zero pages [1].
> > > > Rather than reading/writing these pages to flash resulting
> > > > in increased I/O and flash wear, a bitmap can be used to mark these
> > > > pages as zero at write time, and the pages can be filled at
> > > > read time if the bit corresponding to the page is set.
> > > > With this patch, NVMe writes in Meta server fleet decreased
> > > > by almost 10% with conventional swap setup (zswap disabled).
> > > >
> > > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/
> > > >
> > > > Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> > >
> > > This is awesome.
> > >
> > > > ---
> > > >  include/linux/swap.h |  1 +
> > > >  mm/page_io.c         | 86 ++++++++++++++++++++++++++++++++++++++++++--
> > > >  mm/swapfile.c        | 10 ++++++
> > > >  3 files changed, 95 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index a11c75e897ec..e88563978441 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -299,6 +299,7 @@ struct swap_info_struct {
> > > >       signed char     type;           /* strange name for an index */
> > > >       unsigned int    max;            /* extent of the swap_map */
> > > >       unsigned char *swap_map;        /* vmalloc'ed array of usage counts */
> > > > +     unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
> > >
> > > One bit per swap slot, so 1 / (4096 * 8) = 0.003% static memory
> > > overhead for configured swap space. That seems reasonable for what
> > > appears to be a fairly universal 10% reduction in swap IO.
> > >
> > > An alternative implementation would be to reserve a bit in
> > > swap_map. This would be no overhead at idle, but would force
> > > continuation counts earlier on heavily shared page tables, and AFAICS
> > > would get complicated in terms of locking, whereas this one is pretty
> > > simple (atomic ops protect the map, swapcache lock protects the bit).
> > >
> > > So I prefer this version. But a few comments below:
> >
> > I am wondering if it's even possible to take this one step further and
> > avoid reclaiming zero-filled pages in the first place. Can we just
> > unmap them and let the first read fault allocate a zero'd page like
> > uninitialized memory, or point them at the zero page and make them
> > read-only, or something? Then we could free them directly without
> > going into the swap code to begin with.
> >
> > That's how I thought about it initially when I attempted to support
> > only zero-filled pages in zswap. It could be a more complex
> > implementation though.
>
> We can aim for this eventually, but yeah the implementation will be
> more complex. We'll need to be careful in handling shared zero pages,
> synchronizing accesses and maintaining reference counts. I think we
> will need to special-case swap cache and swap map for these zero pages
> (a ghost zero swap device perhaps), or reinvent the wheel to manage
> these pieces of information.

Isn't there an existing mechanism to have read-only mappings pointing
at the shared zero page, and do COW? Can't we just use that?

I think this is already what we do for mapped areas that were never
written in some cases (see do_anonymous_page()), so it would be just
like that (i.e. as if the mappings were never written). Someone with
more familiarity with this would know better though.

>
> Not impossible, but annoying :) For now, I think Usama's approach is
> clean enough and does the job.

Yeah, I am not against Usama's approach at all. I just want us to
consider both options before we commit to one. If they are close
enough in complexity, it may be worth avoiding swap completely.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 10:19 ` [PATCH 1/2] " Usama Arif
  2024-05-30 12:27   ` Johannes Weiner
  2024-05-30 16:20   ` Yosry Ahmed
@ 2024-05-30 19:58   ` Andrew Morton
  2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2024-05-30 19:58 UTC (permalink / raw)
  To: Usama Arif
  Cc: hannes, yosryahmed, nphamcs, chengming.zhou, linux-mm,
	linux-kernel, kernel-team

On Thu, 30 May 2024 11:19:07 +0100 Usama Arif <usamaarif642@gmail.com> wrote:

> Approximately 10-20% of pages to be swapped out are zero pages [1].
> Rather than reading/writing these pages to flash resulting
> in increased I/O and flash wear, a bitmap can be used to mark these
> pages as zero at write time, and the pages can be filled at
> read time if the bit corresponding to the page is set.
> With this patch, NVMe writes in Meta server fleet decreased
> by almost 10% with conventional swap setup (zswap disabled).

A little nitlet as you'll be altering the code...

> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -172,6 +172,77 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>  	goto out;
>  }
>  
> +static bool is_folio_page_zero_filled(struct folio *folio, int i)
> +{
> +	unsigned long *page;
> +	unsigned int pos;
> +	bool ret = false;
> +
> +	page = kmap_local_folio(folio, i * PAGE_SIZE);

It's rather expected that a local variable called `page' has type
`struct page *'.  Can we use something like `addr' here?

> +	for (pos = 0; pos < PAGE_SIZE / sizeof(*page); pos++) {
> +		if (page[pos] != 0)
> +			goto out;
> +	}
> +	ret = true;
> +out:
> +	kunmap_local(page);
> +	return ret;
> +}



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 16:24     ` Yosry Ahmed
  2024-05-30 19:18       ` Nhat Pham
@ 2024-05-30 20:04       ` Matthew Wilcox
  2024-05-30 20:16         ` Yosry Ahmed
  2024-05-31 18:18         ` Usama Arif
  1 sibling, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2024-05-30 20:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Usama Arif, akpm, nphamcs, chengming.zhou,
	linux-mm, linux-kernel, kernel-team, Hugh Dickins, Huang Ying

On Thu, May 30, 2024 at 09:24:20AM -0700, Yosry Ahmed wrote:
> I am wondering if it's even possible to take this one step further and
> avoid reclaiming zero-filled pages in the first place. Can we just
> unmap them and let the first read fault allocate a zero'd page like
> uninitialized memory, or point them at the zero page and make them
> read-only, or something? Then we could free them directly without
> going into the swap code to begin with.

I was having similar thoughts.  You can see in do_anonymous_page() that
we simply map the shared zero page when we take a read fault on
unallocated anon memory.

So my question is where are all these zero pages coming from in the Meta
fleet?  Obviously we never try to swap out the shared zero page (it's
not on any LRU list).  So I see three possibilities:

 - Userspace wrote to it, but it wrote zeroes.  Then we did a memcmp(),
   discovered it was zeroes and fall into this path.  It would be safe
   to just discard this page.
 - We allocated it as part of a THP.  We never wrote to this particular
   page of the THP, so it's zero-filled.  While it's safe to just
   discard this page, we might want to write it for better swap-in
   performance.
 - Userspace wrote non-zeroes to it, then wrote zeroes to it before
   abandoning use of this page, and so it eventually got swapped out.
   Perhaps we could teach userspace to MADV_DONTNEED the page instead?

Has any data been gathered on this?  Maybe there are other sources of
zeroed pages that I'm missing.  I do remember a presentation at LSFMM
in 2022 from Google about very sparsely used THPs.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 20:04       ` Matthew Wilcox
@ 2024-05-30 20:16         ` Yosry Ahmed
  2024-05-31 18:18         ` Usama Arif
  1 sibling, 0 replies; 12+ messages in thread
From: Yosry Ahmed @ 2024-05-30 20:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Johannes Weiner, Usama Arif, akpm, nphamcs, chengming.zhou,
	linux-mm, linux-kernel, kernel-team, Hugh Dickins, Huang Ying

On Thu, May 30, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, May 30, 2024 at 09:24:20AM -0700, Yosry Ahmed wrote:
> > I am wondering if it's even possible to take this one step further and
> > avoid reclaiming zero-filled pages in the first place. Can we just
> > unmap them and let the first read fault allocate a zero'd page like
> > uninitialized memory, or point them at the zero page and make them
> > read-only, or something? Then we could free them directly without
> > going into the swap code to begin with.
>
> I was having similar thoughts.  You can see in do_anonymous_page() that
> we simply map the shared zero page when we take a read fault on
> unallocated anon memory.
>
> So my question is where are all these zero pages coming from in the Meta
> fleet?  Obviously we never try to swap out the shared zero page (it's
> not on any LRU list).  So I see three possibilities:
>
>  - Userspace wrote to it, but it wrote zeroes.  Then we did a memcmp(),
>    discovered it was zeroes and fall into this path.  It would be safe
>    to just discard this page.
>  - We allocated it as part of a THP.  We never wrote to this particular
>    page of the THP, so it's zero-filled.  While it's safe to just
>    discard this page, we might want to write it for better swap-in
>    performance.

My understanding is that here we check if the entire folio is
zero-filled. If the THP is still intact as a single folio, we will
only apply the optimization if the entire THP is zero-filled. If we
are checking a page that used to be part of a THP, then I think the
THP is already split and swap-in performance would not be affected.

Did I miss anything here?

>  - Userspace wrote non-zeroes to it, then wrote zeroes to it before
>    abandoning use of this page, and so it eventually got swapped out.
>    Perhaps we could teach userspace to MADV_DONTNEED the page instead?

Why wouldn't it be safe to discard the page in this case as well?

>
> Has any data been gathered on this?  Maybe there are other sources of
> zeroed pages that I'm missing.  I do remember a presentation at LSFMM
> in 2022 from Google about very sparsely used THPs.

Apart from that, we may also want to think about shmem if we want a
general approach to avoid swapping out zero pages.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap
  2024-05-30 20:04       ` Matthew Wilcox
  2024-05-30 20:16         ` Yosry Ahmed
@ 2024-05-31 18:18         ` Usama Arif
  1 sibling, 0 replies; 12+ messages in thread
From: Usama Arif @ 2024-05-31 18:18 UTC (permalink / raw)
  To: Matthew Wilcox, Yosry Ahmed
  Cc: Johannes Weiner, akpm, nphamcs, chengming.zhou, linux-mm,
	linux-kernel, kernel-team, Hugh Dickins, Huang Ying


On 30/05/2024 21:04, Matthew Wilcox wrote:
> On Thu, May 30, 2024 at 09:24:20AM -0700, Yosry Ahmed wrote:
>> I am wondering if it's even possible to take this one step further and
>> avoid reclaiming zero-filled pages in the first place. Can we just
>> unmap them and let the first read fault allocate a zero'd page like
>> uninitialized memory, or point them at the zero page and make them
>> read-only, or something? Then we could free them directly without
>> going into the swap code to begin with.
> I was having similar thoughts.  You can see in do_anonymous_page() that
> we simply map the shared zero page when we take a read fault on
> unallocated anon memory.

Thanks Yosry and Matthew. Currently trying to prototype and see how this 
might look. Hopefully should have an update next week.

> So my question is where are all these zero pages coming from in the Meta
> fleet?  Obviously we never try to swap out the shared zero page (it's
> not on any LRU list).  So I see three possibilities:
>
>   - Userspace wrote to it, but it wrote zeroes.  Then we did a memcmp(),
>     discovered it was zeroes and fall into this path.  It would be safe
>     to just discard this page.
>   - We allocated it as part of a THP.  We never wrote to this particular
>     page of the THP, so it's zero-filled.  While it's safe to just
>     discard this page, we might want to write it for better swap-in
>     performance.

Its mostly THP. Alex presented the numbers well in his THP series 
https://lore.kernel.org/lkml/cover.1661461643.git.alexlzhu@fb.com/


>   - Userspace wrote non-zeroes to it, then wrote zeroes to it before
>     abandoning use of this page, and so it eventually got swapped out.
>     Perhaps we could teach userspace to MADV_DONTNEED the page instead?
>
> Has any data been gathered on this?  Maybe there are other sources of
> zeroed pages that I'm missing.  I do remember a presentation at LSFMM
> in 2022 from Google about very sparsely used THPs.


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-05-31 18:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-30 10:19 [PATCH 0/2] mm: store zero pages to be swapped out in a bitmap Usama Arif
2024-05-30 10:19 ` [PATCH 1/2] " Usama Arif
2024-05-30 12:27   ` Johannes Weiner
2024-05-30 16:24     ` Yosry Ahmed
2024-05-30 19:18       ` Nhat Pham
2024-05-30 19:49         ` Yosry Ahmed
2024-05-30 20:04       ` Matthew Wilcox
2024-05-30 20:16         ` Yosry Ahmed
2024-05-31 18:18         ` Usama Arif
2024-05-30 16:20   ` Yosry Ahmed
2024-05-30 19:58   ` Andrew Morton
2024-05-30 10:19 ` [PATCH 2/2] mm: remove code to handle same filled pages Usama Arif

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox