From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42063C27C75 for ; Tue, 11 Jun 2024 18:39:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D34926B00A2; Tue, 11 Jun 2024 14:39:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CE5896B00A5; Tue, 11 Jun 2024 14:39:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD3A46B00AD; Tue, 11 Jun 2024 14:39:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 9C5836B00A2 for ; Tue, 11 Jun 2024 14:39:22 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 129091C262E for ; Tue, 11 Jun 2024 18:39:22 +0000 (UTC) X-FDA: 82219470564.10.A8C79B9 Received: from mail-qv1-f47.google.com (mail-qv1-f47.google.com [209.85.219.47]) by imf05.hostedemail.com (Postfix) with ESMTP id 4D662100022 for ; Tue, 11 Jun 2024 18:39:20 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iNc3QXf2; spf=pass (imf05.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718131160; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kDkH2vvU0sx/oCEzi9/YBrjDW0qvg23Vl0BqpbWQ76E=; b=MiBXVgsPho73OzcUwY5HzUhtYEL3K9VSImmp2rAE2gtmdj3Rc8UzXkf8fIoig9HASEQYfT obTwpFmBFMx4XDaEkFjdLKTfRcj2iuMTYl9mdgu7CIpD4bnWwDxVCJjVVNQWYcsGX7cXut eyNcRug7qiwcF/kKoNAxqD96mx6K5W4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718131160; a=rsa-sha256; cv=none; b=5bLUdOazAj9h4t8bNebC3GunZ02lucNXn7ZupuS84+I1wxUOtXa82pM9URpuIBVJyOXgeD JSyDTMJGVnmiuf8EWCXGeu89totK1vbM66HVkuBhQg5gz3JUQLix5caPIeWd0+XXP5V8CG ewKEV3Tap/h9IH/ha+wD7wCwROj9Ho0= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=iNc3QXf2; spf=pass (imf05.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.47 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qv1-f47.google.com with SMTP id 6a1803df08f44-6b063047958so21533596d6.3 for ; Tue, 11 Jun 2024 11:39:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718131159; x=1718735959; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=kDkH2vvU0sx/oCEzi9/YBrjDW0qvg23Vl0BqpbWQ76E=; b=iNc3QXf2on2hhnCOtJR+l3+wVN23jA4x4BF+o7gfN9sMvOaKogSTgjBCK4RpDxE7Sy 9wRIKnGi+/QftSN8/qzu6pKUnhFunaZw8VSxanyM+EMEunhVC+UPDbdT+ENYg2DQmk4Y OBifP1tfYsy9o0AAaPvphIqZR8J9D/3F48pZo/I2xY7WVqYZGVVFJB42aiMPl8AjlRXE WAUPxIodfyRYWhSHjBY+nnoiQ+ioRb423nbHheZcblLQRdXfYOiTR7ULm7UZCX8Cpie5 Tj2/FsDYyopHbcpRVQZzz1HdbeEwaPPui7UUy90U6ofw+MIdrQeikU8jXmprs2bxYRFx EemQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718131159; x=1718735959; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kDkH2vvU0sx/oCEzi9/YBrjDW0qvg23Vl0BqpbWQ76E=; b=jmh00DU/u/Pr/Rg52ndRgzFnhOno1bfyQTenx+xwNcEZ2gpBAOExM88QzwXfQFtln8 x80B9YJVzNH8GxLtOs9qw9bJwvuDkCGEtPeir0yGDWggaWszPu3SkrTOnHVPi1tyo6Nh dbgQCKdaZrTiLYDwUixCP8MEXDsVGExsUMdC20OCFRZFxCBWRHqkkDvO8kPPczDMmwwf IY/xEWaGaGea+BP2rvFn7HDZEABM/Z9EE0hSahn6d8Otr6+yI4jmvMXDpdC1he+aBd8J 4i7B3EfeMkVU/qz7gZDwK7fm0lDaS3b7Xc4jMebH9aXGkC5O5z/caqOmIXN6kEdDUjMk NC1Q== X-Forwarded-Encrypted: i=1; AJvYcCXkNnpZBvHlIcDT0DNF53GB1KXr/SUJfJ97MWwXG5KIxdxqsCPMTvKKROqJyFlSfBl8YOC5gLgwARHOYu5IN21N4Pc= X-Gm-Message-State: AOJu0YysQcbvZ/pHnnR31T+XudpsSqk6DiQBGeXbQL2ZZ8NYSCJqmzsJ ROTh311OfUSSGTQXIGmkxomPy0zf8lhbbf6zA0wZFE8aOpmApWH43NIlf1A+ndTqlr7cjpS3kll fkL7IISjnICoBFWkhbZXus1IL+bw= X-Google-Smtp-Source: AGHT+IE+NIIz+mkqTDA5bfcLoCUUMwh3Cs61+jxaywiEL9W/NJzHnHqaQgukA/NOFwUC2ffP9m7uK8A4usyeOXTvrBk= X-Received: by 2002:ad4:5cc4:0:b0:6b0:71c0:cbaa with SMTP id 6a1803df08f44-6b071c0cd62mr117404576d6.33.1718131159311; Tue, 11 Jun 2024 11:39:19 -0700 (PDT) MIME-Version: 1.0 References: <20240610121820.328876-1-usamaarif642@gmail.com> <20240610121820.328876-2-usamaarif642@gmail.com> In-Reply-To: <20240610121820.328876-2-usamaarif642@gmail.com> From: Nhat Pham Date: Tue, 11 Jun 2024 11:39:08 -0700 Message-ID: Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap To: Usama Arif Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, david@redhat.com, ying.huang@intel.com, hughd@google.com, willy@infradead.org, yosryahmed@google.com, chengming.zhou@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Stat-Signature: o1a39w5cf44r6u4qw8c6hjoi89ou1mzg X-Rspamd-Queue-Id: 4D662100022 X-Rspam-User: X-HE-Tag: 1718131160-140274 X-HE-Meta: U2FsdGVkX19ODWNBWC3aknGXU4ak2hJvpHTpr0PL6Pffac2HdsCUPa94gAKWwILTVE9kaeB4CerLuVPWiTIEF0cimqLPHjiu8q6MpiV4csup40hHvwtYZPHiGl0YZXYdadEHcSQsgTm3/7DlXNwpUWdT/6hfxIsOsZokhovDS7jr7QDmTewqwgYojpjX6NUocwWQLSIbnQPVwhhi96Rpq1WMh0loU6h2r/ILOtoeSCnQj+3BNGWHYTyzQwwVP/sf5BbaBHwxY3xOEE/QTUVXCu3H1yvWcJs8HvYJkZS5/XACcn0DMqAHNNTngrVuQm1etLt1hDlDjWf3ff72JfErF6sXAcDr/hNepKCtwyFIYzR7UA58//I35dh/FYePtY+D2Qs8xbK7GfWDtXIXSwQrWK0XMrIfJRtSxlSDFz/efMkenJsUhFkxhwhemgDJveK1Buj/OCTQ97SpDwkBcFWFILEIfht4t84yHYA6sJpn7NF1JBG9oY0PUI+6dl9pcKTOi0djMWpK+UgsmgcuNXUhJ6Cj4dK8WKGo3O8dYrSQO7cMYvnH0/kyy+oME/tmkwDOW6d419gVR3IX8Kn80+5KkqG7irEaAKMKn/pdue/OyOEUmGhPbKwSFDZdneNvQ1+4jBsW4NUntNd/PaBaBXP0D7JyMWoTj3Ba85q95/uQCNZmIYzLvZm0RQyQnjlRDUk7r9Uou+57eQ+07pM8GdzC+DUv1YoyQXLjF2KmJXZXJQwgqMY1TKhVdQdbA/+T+XD//17cek9Jhlwwo5msLsNqrWoQhkZ6QWGk49HoMlbow/Kas9fz8iWZ7roZpjxIrS4yvq3GzqlyTy/4j+/9qNhwt7ap5wmhfCBtniIds6tkUDQFI+o8cZR6SMsINHFM/LCm9uXpcFz31W0aLH1n9gYtRzaWw/yn63rbNrvxB5Jj2n6ZrR53vc7woVWmruN0pOzNuFaxS88tD0KK1dgSBS2 VzIniAF6 shstogZR9oJMVDBk88Ew1CORaKK5JZUNWboXrFLMJFL7XHDnb+tfuPX83SVRecrJ+w6seZZa3NpyxAUwpQbIYErg7qAlXMnlK/7+UjqtvQHpRU0EyAU284O/U0B59OPXFJdeyCxpO9JPPB272Yqu2BvFz5juEzD79YMYPGMxKoQqiRwC1b/HJw0XlxxwkXJkbQPViGZ6ZOgMFQhMmBT3U1gbIoMClLKQwZOxfc682ymSOtNorKTXv/lgFY4NmjWhisBjjFMG9zmejLu9coDQdZN3wZepbrrQUWnsoV5bQdzJLvQmTQoyScrPHCmOcy0iZvNpu7oMjDyeDCZsWT6m0KFW7j0Ni5VGp2t11 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 10, 2024 at 5:18=E2=80=AFAM Usama Arif = wrote: > > Approximately 10-20% of pages to be swapped out are zero pages [1]. > Rather than reading/writing these pages to flash resulting > in increased I/O and flash wear, a bitmap can be used to mark these > pages as zero at write time, and the pages can be filled at > read time if the bit corresponding to the page is set. > With this patch, NVMe writes in Meta server fleet decreased > by almost 10% with conventional swap setup (zswap disabled). > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d= 1344dde9fce0@epcms5p1/ > > Signed-off-by: Usama Arif > --- > include/linux/swap.h | 1 + > mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- > mm/swapfile.c | 21 +++++++++- > 3 files changed, 111 insertions(+), 3 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index a11c75e897ec..e88563978441 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -299,6 +299,7 @@ struct swap_info_struct { > signed char type; /* strange name for an index */ > unsigned int max; /* extent of the swap_map */ > unsigned char *swap_map; /* vmalloc'ed array of usage coun= ts */ > + unsigned long *zeromap; /* vmalloc'ed bitmap to track zer= o pages */ > struct swap_cluster_info *cluster_info; /* cluster info. Only for= SSD */ > struct swap_cluster_list free_clusters; /* free clusters list */ > unsigned int lowest_bit; /* index of first free in swap_ma= p */ > diff --git a/mm/page_io.c b/mm/page_io.c > index a360857cf75d..2cac1e11fb85 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struc= t *sis, > goto out; > } > > +static bool is_folio_page_zero_filled(struct folio *folio, int i) > +{ > + unsigned long *data; > + unsigned int pos, last_pos =3D PAGE_SIZE / sizeof(*data) - 1; > + bool ret =3D false; > + > + data =3D kmap_local_folio(folio, i * PAGE_SIZE); > + if (data[last_pos]) > + goto out; > + for (pos =3D 0; pos < PAGE_SIZE / sizeof(*data); pos++) { > + if (data[pos]) > + goto out; > + } > + ret =3D true; > +out: > + kunmap_local(data); > + return ret; > +} > + > +static bool is_folio_zero_filled(struct folio *folio) > +{ > + unsigned int i; > + > + for (i =3D 0; i < folio_nr_pages(folio); i++) { > + if (!is_folio_page_zero_filled(folio, i)) > + return false; > + } > + return true; > +} > + > +static void folio_zero_fill(struct folio *folio) > +{ > + unsigned int i; > + > + for (i =3D 0; i < folio_nr_pages(folio); i++) > + clear_highpage(folio_page(folio, i)); > +} > + > +static void swap_zeromap_folio_set(struct folio *folio) > +{ > + struct swap_info_struct *sis =3D swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i =3D 0; i < folio_nr_pages(folio); i++) { > + entry =3D page_swap_entry(folio_page(folio, i)); > + set_bit(swp_offset(entry), sis->zeromap); > + } > +} > + > +static void swap_zeromap_folio_clear(struct folio *folio) > +{ > + struct swap_info_struct *sis =3D swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i =3D 0; i < folio_nr_pages(folio); i++) { > + entry =3D page_swap_entry(folio_page(folio, i)); > + clear_bit(swp_offset(entry), sis->zeromap); > + } > +} > + > +static bool swap_zeromap_folio_test(struct folio *folio) > +{ > + struct swap_info_struct *sis =3D swp_swap_info(folio->swap); > + swp_entry_t entry; > + unsigned int i; > + > + for (i =3D 0; i < folio_nr_pages(folio); i++) { > + entry =3D page_swap_entry(folio_page(folio, i)); > + if (!test_bit(swp_offset(entry), sis->zeromap)) > + return false; > + } > + return true; > +} > + > /* > * We may have stale swap cache pages in memory: notice > * them here and get rid of the unnecessary final write. > @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeba= ck_control *wbc) > folio_unlock(folio); > return ret; > } > + > + if (is_folio_zero_filled(folio)) { > + swap_zeromap_folio_set(folio); > + folio_start_writeback(folio); > + folio_unlock(folio); > + folio_end_writeback(folio); > + return 0; > + } > + swap_zeromap_folio_clear(folio); > if (zswap_store(folio)) { > folio_start_writeback(folio); > folio_unlock(folio); > @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synch= ronous, > psi_memstall_enter(&pflags); > } > delayacct_swapin_start(); > - > - if (zswap_load(folio)) { > + if (swap_zeromap_folio_test(folio)) { > + folio_zero_fill(folio); > + folio_mark_uptodate(folio); > + folio_unlock(folio); > + } else if (zswap_load(folio)) { > folio_mark_uptodate(folio); > folio_unlock(folio); > } else if (data_race(sis->flags & SWP_FS_OPS)) { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index f1e559e216bd..90451174fe34 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swa= p_cluster_list *list, > static void swap_cluster_schedule_discard(struct swap_info_struct *si, > unsigned int idx) > { > + unsigned int i; > + > /* > * If scan_swap_map_slots() can't find a free cluster, it will ch= eck > * si->swap_map directly. To make sure the discarding cluster isn= 't > @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swa= p_info_struct *si, > */ > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > SWAP_MAP_BAD, SWAPFILE_CLUSTER); > + /* > + * zeromap can see updates from concurrent swap_writepage() and s= wap_read_folio() > + * call on other slots, hence use atomic clear_bit for zeromap in= stead of the > + * non-atomic bitmap_clear. > + */ > + for (i =3D 0; i < SWAPFILE_CLUSTER; i++) > + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); > > cluster_list_add_tail(&si->discard_clusters, si->cluster_info, id= x); > > @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *s= i, unsigned long idx) > static void swap_do_scheduled_discard(struct swap_info_struct *si) > { > struct swap_cluster_info *info, *ci; > - unsigned int idx; > + unsigned int idx, i; > > info =3D si->cluster_info; > > @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_inf= o_struct *si) > __free_cluster(si, idx); > memset(si->swap_map + idx * SWAPFILE_CLUSTER, > 0, SWAPFILE_CLUSTER); > + for (i =3D 0; i < SWAPFILE_CLUSTER; i++) > + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap= ); > unlock_cluster(ci); > } > } > @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct= *p, swp_entry_t entry) > count =3D p->swap_map[offset]; > VM_BUG_ON(count !=3D SWAP_HAS_CACHE); > p->swap_map[offset] =3D 0; > + clear_bit(offset, p->zeromap); Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if we can have a problem, where: 1. The swap entry has its zeromap bit set, and is freed to the swap slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is reclaimed from the swap cache, and all the processes referring to it are terminated, which decrements the swap count to 0 (swap_free() -> __swap_entry_free() -> free_swap_slots()) 2. The swap slot is then re-used in swap space allocation (add_to_swap()) - its zeromap bit is never cleared. 3. swap_writepage() writes that non-zero page to swap 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for the entry is set, so populates a zero page for it. zswap in the past has to carefully invalidate these leftover entries quite carefully. Chengming then move the invalidation point to free_swap_slot(), massively simplifying the logic. I wonder if we need to do the same here?