From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69C71C27C65 for ; Tue, 11 Jun 2024 18:50:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F13356B00B4; Tue, 11 Jun 2024 14:50:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC30B6B00BD; Tue, 11 Jun 2024 14:50:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D3CCB6B00BF; Tue, 11 Jun 2024 14:50:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B39916B00B4 for ; Tue, 11 Jun 2024 14:50:17 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 501671410D6 for ; Tue, 11 Jun 2024 18:50:17 +0000 (UTC) X-FDA: 82219498074.14.CE174EF Received: from mail-ed1-f48.google.com (mail-ed1-f48.google.com [209.85.208.48]) by imf24.hostedemail.com (Postfix) with ESMTP id 3B1F9180012 for ; Tue, 11 Jun 2024 18:50:14 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WRVhp8LW; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718131815; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bqmR2CwNzfC5dY9c/qTiO9vjLhQkJvchWeugaaFTjFI=; b=JPw94Z9PsckRdV0FtMlHpEl/eWZUM9UGK7BZ59Kih8sIstRDp5ss2ndWmUjrp2sND3YkF6 MnNKUeKyp+MBF60l3QeFJc2QiXJconqYrmt03jIak0+njwVpU/DuNTeJlcdbdvNAqUskUA p6Buc5iN9vYcJDZu3s+4vSqI5WilLWA= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=WRVhp8LW; spf=pass (imf24.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.208.48 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718131815; a=rsa-sha256; cv=none; b=43PdZgLQEgx7OmmvGgzy62F3JSVN6LjAYVsC7VrZFjWsvCes+e4ObV5H1sXNSJ9cYxvQbK U0c3Cgn7w3baAzYw2KEAiQ6xRzi8BnWX4DAqbji+YTk0lHohSOzzBw2mBO6YmJ4jKhCjKg yuzVbp1d8+AInc8DGdzH2tHGY6O/mfM= Received: by mail-ed1-f48.google.com with SMTP id 4fb4d7f45d1cf-57a44c2ce80so1950854a12.0 for ; Tue, 11 Jun 2024 11:50:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718131814; x=1718736614; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=bqmR2CwNzfC5dY9c/qTiO9vjLhQkJvchWeugaaFTjFI=; b=WRVhp8LW+ms+1EiyZBTzoVEcTBiUa+otWMdjRNXglZn5A0rVZkEBThmCuzQZSqWTsI Jp7cU/vM7+GNAnW62rBdDyFgGFyacazJwWf15wN8JIMA/WVDyE0o0w+ktFP+iHpix/Dj U6dXWPjeP69dnVa3Lrdp+gPimGsOSxo49u1cZexLlnNFwNINJ44aIIaKFiW+NYTr85xF lM1an6t+mFtwuuySKMy/DqA5Cmm3OsuevrLxzw+/1s/4YxykhAu/uhKWKVCkvlIpPLFF kXbfMkmQeZbG8B/ivfhX7cXKZLyzqvqIMeHvR+mu1J8czIcgJrljl7VndwAXR4PONpjn /uSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718131814; x=1718736614; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=bqmR2CwNzfC5dY9c/qTiO9vjLhQkJvchWeugaaFTjFI=; b=Z0HV4iLBlEIj2n863IhsIyOdof3TWa9uyMWO2X7EIl+0Sf8Pc13x6L/ayVk6iAzY7q K7HT1AJ++8h/fEjFWmzMPnK6UCwnkAog3JxVCtCL0ouYTCehIqTT2eLq7agWupiawHFY kmEI7zYyVdOgwfIKsgLTJfCOcbbucn0rllp5jUTGcIQB8eYy9XuqT/twEVE0Ih6MGiX1 Gl/99oSvTOdfAznFziaZZhL7eBpWZYRxvl0a2dftcHbVMGh1jv5oJLhG4y7jjjuSl3Gb Y90zPJG5pES9msq9OdvkxjAlwGRuDh3iaHlFLeFxgXAMCy3SxTIdtayrxqAd3Ck8YIcT dIhw== X-Forwarded-Encrypted: i=1; AJvYcCWO2aF3JSJxJQmLvrIhEaaZhNXjpgjMj1hLfRP1XH7UvoHWaZraKEi4iF9VYnmhByL2Zj7wyYxMTY7PD8ct5FBNiSY= X-Gm-Message-State: AOJu0YwvwekSVFJpUGi3kl8krxtKllSUFGQRxh67q5zk9O2nngcR7umY lBum1dv+bXRB+S7fpdxZ438zXzyRApJocrILngspapwVOEdKn6A0 X-Google-Smtp-Source: AGHT+IEeooKv++9An6Cw6IMTAU8dmlKCFHrUWYtn1Wz5iEjaM8TvXLcJ2Toq5qJyX1d/4MC9+ATYCA== X-Received: by 2002:a50:a685:0:b0:578:72d4:e3b0 with SMTP id 4fb4d7f45d1cf-57c509a6080mr8137283a12.36.1718131813239; Tue, 11 Jun 2024 11:50:13 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:eb:d0d0:c7fd:c82c? ([2620:10d:c092:500::7:57b4]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-57c6c95bb33sm6522695a12.8.2024.06.11.11.50.12 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 11 Jun 2024 11:50:12 -0700 (PDT) Message-ID: Date: Tue, 11 Jun 2024 19:50:11 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap To: Nhat Pham Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, david@redhat.com, ying.huang@intel.com, hughd@google.com, willy@infradead.org, yosryahmed@google.com, chengming.zhou@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Shakeel Butt References: <20240610121820.328876-1-usamaarif642@gmail.com> <20240610121820.328876-2-usamaarif642@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 3B1F9180012 X-Rspam-User: X-Rspamd-Server: rspam09 X-Stat-Signature: cm4uhyziunzijxmr3dxomcjbj3gn4oes X-HE-Tag: 1718131814-647279 X-HE-Meta: U2FsdGVkX19BRfyzaA1PGkM6kHv86d8LuCp1j30kG6zvXdlqGTQjd9C6yjE7Ch+riLms0XnYgI0CgKFBbYHUT8uqo0Sg3jZM8vjpjQM1fwq+u+YC3MYFr9RyeO8iP/RQE9TD8qSYkFZxs4r2O8cOKUtDIco3HYBTlkrMfI4GWg4LZ2PuEn60mJdWk7XIPuwBRvD8CSEUe7OK1THLSH/f/ZlAzOaigO+av35+newVmepqg4tvfn/2iHTEJv94to+G57NMcL/R4bvHC58uwFkeV5XSgTC8gRRKU1L8o93aW3lryePUpiRM4R50o4+EiWvGK0sK418CXcooWHZ7vtCzxfZYF9cZo5DczmQ+mDBfa+QuQCcVIGlok99W13SwN33iy7XXl29Dc/0kPfXgEG48IJKdFRRjU9bIuaK9PlSpEUtJakFVm2fLTVdQhXOvVUrIzW35U1f3J/NrzeytUtFMwbqgSiI57N+z26EOD09s/ZDObgzmISifHUiCbeRHkFWpJAI53SfxrKG/u/SAF/MbYLRsQu9FUpHBj7/JLxNOmUI/gZAnl57/VOXyFFQH1otNSJ/AsxycyJm81DrMdEk/7mZpzKS3LbHzS0mnZw/aea5IodmYoS0wO/QyDWsETCOWtQPFEJZPP8ewqXjJqOweK6nADDH+ybTMp4LFDRNxGs7LybhkzXeS4hKQJEOszuDOvovJjA0UgWafzr4/1far5voxQb7qWYOaq1XSzbROCoFx7CIwUa1aMo/mjBwE/TIN9aBRCqc4tRrYVpOUB6ZXomFIOO4b52FCFLID6Kxx7Sh7keh7t+WXeyKvSI+222qxMFF/Yg+ra+Cl1yfGJCVLqKMfkhsOHToRYxEfDGpYhSKo9EjOR2E0dWAojj0joafS9yjhFWSYlLeiwa7kpdnYLDKFSZcQ2zeQ9m6+NjeuLFhyhMnLtymGlGfbshdQtdopy/ewM9xQAP9i45+8bDc GsxBI1Ae BMQ8g0989/ExXgcmvXKp8kAZhYCgPTjuxTb1dxbEhGKzLwq0M3jP7o7myFa1iT/caLk+4q4myS8mIiu5AdQODrCE+BGC18biIVjKLh4s8ntwIyCQIqRCzGadaBVayd+/WmVB/TrQT9PVCnf2c/95Q9i/z+oTEqWHQHbHmzIx0Hl8Vdgfj3VGwnfKnCs9PgKTbuytQ66SF7eX7goH+hMxN8txOMcX03O+BqrJB/wPkD8pVXoCw1k/v62D7dMn+PatlHgpAdYfjGda6t0L8y0//W2oo+GcPK8O2OFPxEvsPUQF72taSwOT0XMBXtZJWDSvNe/S+cgFAoM3Ua09zeGMkA2wBIVzE7tqGkmxShMxrzcG1HWN793Itan97bDobdAhmp+U+MKtBiobBEHTf7e7BQnMrfr9QTUC9Yy6yx9xKZlQ6UsXQm8fBhxrS4M3Qd2LZaiFst857/L23qZRrndUZtTTHt+0bJBjPqCE9czGd7ztf3batSvIDawiHsQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/06/2024 19:39, Nhat Pham wrote: > On Mon, Jun 10, 2024 at 5:18 AM Usama Arif wrote: >> Approximately 10-20% of pages to be swapped out are zero pages [1]. >> Rather than reading/writing these pages to flash resulting >> in increased I/O and flash wear, a bitmap can be used to mark these >> pages as zero at write time, and the pages can be filled at >> read time if the bit corresponding to the page is set. >> With this patch, NVMe writes in Meta server fleet decreased >> by almost 10% with conventional swap setup (zswap disabled). >> >> [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ >> >> Signed-off-by: Usama Arif >> --- >> include/linux/swap.h | 1 + >> mm/page_io.c | 92 +++++++++++++++++++++++++++++++++++++++++++- >> mm/swapfile.c | 21 +++++++++- >> 3 files changed, 111 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index a11c75e897ec..e88563978441 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -299,6 +299,7 @@ struct swap_info_struct { >> signed char type; /* strange name for an index */ >> unsigned int max; /* extent of the swap_map */ >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ >> + unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ >> struct swap_cluster_list free_clusters; /* free clusters list */ >> unsigned int lowest_bit; /* index of first free in swap_map */ >> diff --git a/mm/page_io.c b/mm/page_io.c >> index a360857cf75d..2cac1e11fb85 100644 >> --- a/mm/page_io.c >> +++ b/mm/page_io.c >> @@ -172,6 +172,82 @@ int generic_swapfile_activate(struct swap_info_struct *sis, >> goto out; >> } >> >> +static bool is_folio_page_zero_filled(struct folio *folio, int i) >> +{ >> + unsigned long *data; >> + unsigned int pos, last_pos = PAGE_SIZE / sizeof(*data) - 1; >> + bool ret = false; >> + >> + data = kmap_local_folio(folio, i * PAGE_SIZE); >> + if (data[last_pos]) >> + goto out; >> + for (pos = 0; pos < PAGE_SIZE / sizeof(*data); pos++) { >> + if (data[pos]) >> + goto out; >> + } >> + ret = true; >> +out: >> + kunmap_local(data); >> + return ret; >> +} >> + >> +static bool is_folio_zero_filled(struct folio *folio) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + if (!is_folio_page_zero_filled(folio, i)) >> + return false; >> + } >> + return true; >> +} >> + >> +static void folio_zero_fill(struct folio *folio) >> +{ >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) >> + clear_highpage(folio_page(folio, i)); >> +} >> + >> +static void swap_zeromap_folio_set(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + set_bit(swp_offset(entry), sis->zeromap); >> + } >> +} >> + >> +static void swap_zeromap_folio_clear(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + clear_bit(swp_offset(entry), sis->zeromap); >> + } >> +} >> + >> +static bool swap_zeromap_folio_test(struct folio *folio) >> +{ >> + struct swap_info_struct *sis = swp_swap_info(folio->swap); >> + swp_entry_t entry; >> + unsigned int i; >> + >> + for (i = 0; i < folio_nr_pages(folio); i++) { >> + entry = page_swap_entry(folio_page(folio, i)); >> + if (!test_bit(swp_offset(entry), sis->zeromap)) >> + return false; >> + } >> + return true; >> +} >> + >> /* >> * We may have stale swap cache pages in memory: notice >> * them here and get rid of the unnecessary final write. >> @@ -195,6 +271,15 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) >> folio_unlock(folio); >> return ret; >> } >> + >> + if (is_folio_zero_filled(folio)) { >> + swap_zeromap_folio_set(folio); >> + folio_start_writeback(folio); >> + folio_unlock(folio); >> + folio_end_writeback(folio); >> + return 0; >> + } >> + swap_zeromap_folio_clear(folio); >> if (zswap_store(folio)) { >> folio_start_writeback(folio); >> folio_unlock(folio); >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool synchronous, >> psi_memstall_enter(&pflags); >> } >> delayacct_swapin_start(); >> - >> - if (zswap_load(folio)) { >> + if (swap_zeromap_folio_test(folio)) { >> + folio_zero_fill(folio); >> + folio_mark_uptodate(folio); >> + folio_unlock(folio); >> + } else if (zswap_load(folio)) { >> folio_mark_uptodate(folio); >> folio_unlock(folio); >> } else if (data_race(sis->flags & SWP_FS_OPS)) { >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index f1e559e216bd..90451174fe34 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -453,6 +453,8 @@ static unsigned int cluster_list_del_first(struct swap_cluster_list *list, >> static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> unsigned int idx) >> { >> + unsigned int i; >> + >> /* >> * If scan_swap_map_slots() can't find a free cluster, it will check >> * si->swap_map directly. To make sure the discarding cluster isn't >> @@ -461,6 +463,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> */ >> memset(si->swap_map + idx * SWAPFILE_CLUSTER, >> SWAP_MAP_BAD, SWAPFILE_CLUSTER); >> + /* >> + * zeromap can see updates from concurrent swap_writepage() and swap_read_folio() >> + * call on other slots, hence use atomic clear_bit for zeromap instead of the >> + * non-atomic bitmap_clear. >> + */ >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) >> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); >> >> cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); >> >> @@ -482,7 +491,7 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) >> static void swap_do_scheduled_discard(struct swap_info_struct *si) >> { >> struct swap_cluster_info *info, *ci; >> - unsigned int idx; >> + unsigned int idx, i; >> >> info = si->cluster_info; >> >> @@ -498,6 +507,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si) >> __free_cluster(si, idx); >> memset(si->swap_map + idx * SWAPFILE_CLUSTER, >> 0, SWAPFILE_CLUSTER); >> + for (i = 0; i < SWAPFILE_CLUSTER; i++) >> + clear_bit(idx * SWAPFILE_CLUSTER + i, si->zeromap); >> unlock_cluster(ci); >> } >> } >> @@ -1336,6 +1347,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) >> count = p->swap_map[offset]; >> VM_BUG_ON(count != SWAP_HAS_CACHE); >> p->swap_map[offset] = 0; >> + clear_bit(offset, p->zeromap); > Hmm so clear_bit() is done at the swap_entry_free() point. I wonder if > we can have a problem, where: > > 1. The swap entry has its zeromap bit set, and is freed to the swap > slot cache (free_swap_slot() in mm/swap_slots.c). For instance, it is > reclaimed from the swap cache, and all the processes referring to it > are terminated, which decrements the swap count to 0 (swap_free() -> > __swap_entry_free() -> free_swap_slots()) > > 2. The swap slot is then re-used in swap space allocation > (add_to_swap()) - its zeromap bit is never cleared. > > 3. swap_writepage() writes that non-zero page to swap In swap_writepage, with this patch you have:     if (is_folio_zero_filled(folio)) {         swap_zeromap_folio_set(folio);         folio_unlock(folio);         return 0;     }     swap_zeromap_folio_clear(folio); i.e. if folio is not zero filled, swap_zeromap_folio_clear will be called and the bit is cleared, so I think it would take care of this scenario? swap_read_folio will see the bit cleared in step 4. > 4. swap_read_folio() checks the bitmap, sees that the zeromap bit for > the entry is set, so populates a zero page for it. > > zswap in the past has to carefully invalidate these leftover entries > quite carefully. Chengming then move the invalidation point to > free_swap_slot(), massively simplifying the logic. > > I wonder if we need to do the same here?