From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 22EA4C25B76 for ; Thu, 30 May 2024 16:25:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A57DE6B009D; Thu, 30 May 2024 12:25:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E05D6B00A0; Thu, 30 May 2024 12:25:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85AF36B00A1; Thu, 30 May 2024 12:25:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 68B8B6B009D for ; Thu, 30 May 2024 12:25:00 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 31E5840DB9 for ; Thu, 30 May 2024 16:25:00 +0000 (UTC) X-FDA: 82175586360.14.1305496 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf20.hostedemail.com (Postfix) with ESMTP id 5D54E1C0023 for ; Thu, 30 May 2024 16:24:58 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=tthuGglo; spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717086298; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5FNesVw/sAj7Em9Zy23GdchPLFbPVXVlEFF3WazTuoo=; b=bnc3pcnDqJM+SL8QGYrC6wGZDcyQJrh4Y103RYH3oZO0m+eiRLL8UhIQ7rShAKX/tL0V2p mj6oJzLu8Kreb7aWkWVETjPGf3RHXEaHgIgZrpruh4Rl2S0ByDBu4MpWKd/tYRqEKDGFch oF4UBRzYatexzVx/oYJnfCgTp4EYtiA= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=tthuGglo; spf=pass (imf20.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717086298; a=rsa-sha256; cv=none; b=FwSF839Ig9ZlTh28mrlC60pIysrH7wzJNk4yjI97mVoMjmgyhZy3iPBV7wdxF4pHa3N5++ ERUGMVbauIlgb97i934HstZbVxx36hwbwzSVQ8fec8EaxgUIAG/dDp2dO0S9q6EZ45Hp8H /0R5SSFNJd2AwBAw3aMBL87DMd2e0RA= Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a59a352bbd9so174900966b.1 for ; Thu, 30 May 2024 09:24:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717086297; x=1717691097; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5FNesVw/sAj7Em9Zy23GdchPLFbPVXVlEFF3WazTuoo=; b=tthuGglosUHezPu9Xkhyx2Lt0m1lUj2WqzxZbgOy667enZEFDmhl48LKQ141D5XSgi E+oCsehCsThZ2dS4MixO8DgzHMB+A9oHDKbbxLhmO6QJdq+cBGVoJtTErGvqeD33vMZk 8tjwXJz7CdwZ+S1e+JwIx6HQvaQ0N7c0oPQ+7oeo7iMwR1GijbVnvs+joRcFDJvEDj// ZQqvJPvgN4dtJNt5b0EWHTDQ1g93gMvHfKQ5BkP7BwFFAhk5AOtLhXTjSoXUWxpAyN24 vk69GwFV7/ggiB0ibp8dF0ni15Li2LqIWj+Q13GcXe7505KxR+ifk0G/Rph1hq68m9ht NcFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717086297; x=1717691097; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5FNesVw/sAj7Em9Zy23GdchPLFbPVXVlEFF3WazTuoo=; b=p8wRuWjaMHZ/m8immXwZgG/JnyBYhc1t0R7AUnw49JQN3pWJWUrqQjoxRVV4TvXBjc c5R+C5Mx/PiZm4gz3T+IM8TbGdIWJ/0fpzC5H3mbnWH9OXmcRb0UQcnM4vnuHM72ECUd mgPhq1dpWRl+EmjLwH4vRMMhyfG2D/8UI11LVN4gKp5EHXDBgqd0HdcnUWTk9gZc0vys aaHeSEnYTQYwkie7RUgwK73BKqx3K/mGf/wMOV8TUv5wQyO53Raa0FsfS6HzwU+R/whF UJ2uPMm4Nwaa0j5sUnl0NYDnpz5FjY2imCGRr4IXZAgd3nZt9O4jCb6Zr2TJ0YEjAT8k oloQ== X-Forwarded-Encrypted: i=1; AJvYcCVFKrP90VxQp5va9RZPxrcXHS/XDvejh+U6PKAvjLZW2U8JPiHxq20YSbNUuh794cnidKlJeeqAXbw4cVjsic7TlbA= X-Gm-Message-State: AOJu0YxcTgvIQgzCuN/1QYEzA65JvRo/jkVjPNBa5vIkgFGBZxWAOxmc cgph0TYUWkCjUCYDib+xXs7xOS+6+HHeYpR4INB0R1sHPCCAKu/rWyfcfUvjh+gG2dC7t4i0t8k lm1xF5brDG2Cm4KSLbzNvjM0pFGzv1xqgKzQW X-Google-Smtp-Source: AGHT+IFwLT2oLWqkwAVjuKHT0oAS13N6O5emKTnqOQ4nc8/3wmrmPqfIXiLZwyctW/dHf5j6sSQb8/kI/xPKl5Vb++s= X-Received: by 2002:a17:906:f143:b0:a59:be21:3587 with SMTP id a640c23a62f3a-a65f0918c9cmr184816566b.8.1717086296523; Thu, 30 May 2024 09:24:56 -0700 (PDT) MIME-Version: 1.0 References: <20240530102126.357438-1-usamaarif642@gmail.com> <20240530102126.357438-2-usamaarif642@gmail.com> <20240530122715.GB1222079@cmpxchg.org> In-Reply-To: <20240530122715.GB1222079@cmpxchg.org> From: Yosry Ahmed Date: Thu, 30 May 2024 09:24:20 -0700 Message-ID: Subject: Re: [PATCH 1/2] mm: store zero pages to be swapped out in a bitmap To: Johannes Weiner Cc: Usama Arif , akpm@linux-foundation.org, nphamcs@gmail.com, chengming.zhou@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Hugh Dickins , Huang Ying Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: j74i3fnsdjody3wz9y765kmgtehiwmwp X-Rspamd-Queue-Id: 5D54E1C0023 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1717086298-458867 X-HE-Meta: U2FsdGVkX1+npj9yqQxxi3DTeE+mZ4bSncoHRNyVx8iITT8jKDk33k6Mv3FHXDknwRIyaW1QItcyQ1Owov4zh1l85wnlTkFPNBGg/VDoubtPRVDEIJA426Kr/ouVGP8jTTYM/IOE4Cp0fg5RLJlPtIOmxqmK0CCdZRdEYz/wkh67HT/Z+Z4NsvQfm7k5829k0aE56eWqSx9oga7uGcnYlysaysENdBr4rnPNBZvd7U1ETBsTEsqQ6JrIeTdpiRMueTdMNsLZL9NO1C3BzP4g/2Dq2PlkMhoO8w1VVVyxsD7ymiumnN4JjT5hyA5gquArtAOkyEYJ+smaTVRledQifkpFzENFLfyZorLur7+B5IezRG8swgR/WQQbH/fe9anrIWvSQNvOefwsa4t+aki0OXw3mcJebO/+4FUVJkbfd8w4dlRvjSN7/8Hj1FU4WVYLnvUig/8XAnpa7E+LDLWMigbSoDDZhSEhbtNxCeey0qzITNF74KF+8ymvWP+STK/NM3Gg/I3fS/r21642mMPtGY4nS+t3O6lvywfjTX0Ln2+KiXp0wf8+iOYxinWuPkGFJEh4LKpENYYCxchXozO8In5VIgU8rjCnZSV7Y05bAlGjBznID3hdXzx4Xg9mnvTWuLQue7ec+n08pjQPw7Qm8D9h3UgebG46I/XT6a2s1tRwqe+kXnqTlUVsJg0XX1I3SQyoCXeA5ox06gLivedlwzWK9oslsOVd9uJmSsktTxaFklp5vEseW4WXxHUijrpg1J0SFbsQwQAgXcp0Ksf37Qe/pH/omULDNL+5y5WVPI7wtNlg4p8/pf6cVT7QmbSV0+f74z9KUzoylrI77SCFnqTpZCJY+NT050BRnM6CsZZ4d41oHV3R1ZhOh1CstORPuvufXqVIjkMboyF9mipSK9Bh2ufzIuUmGB+XYyvcYqIBmLdgH7kfqTXMP8Ew22j3ZGZMIcorki8Cq4pReXV QPsWtJ5L 2n7shNNxWAmNkW58QACq8vymkBKLAuo0KvW8h7Z/Kwv1MWQhNEODujvzMMClXdMoZam3oPREpv4osl1OOPv/N1p4/gJDxeCUn9UiTvs/x1caOyUUR71zsrJuStxTQyDkLbWoEc8lzf4IJhhLzNroKwQ0emGzWWxbaFwK9XUvCPxM6P4jbPMItM/Xc1tYcVpt3mgmCOgj3spkxvBbyPrzDPfSRL2/LB0F4Y3B7lZv0oM/5759KlOg5J5jlD5YJK5Yh0suU3zXanzI1/wGIx1bc1NAPBVBRI1w9uJrf8/GRbi7RJgE4hJnkqzBPjW6mmtvs41JIMQLiv8yuAMLzpX0u7ehJGEySbLpXbvaTRHidvenRIBE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 30, 2024 at 5:27=E2=80=AFAM Johannes Weiner wrote: > > On Thu, May 30, 2024 at 11:19:07AM +0100, Usama Arif wrote: > > Approximately 10-20% of pages to be swapped out are zero pages [1]. > > Rather than reading/writing these pages to flash resulting > > in increased I/O and flash wear, a bitmap can be used to mark these > > pages as zero at write time, and the pages can be filled at > > read time if the bit corresponding to the page is set. > > With this patch, NVMe writes in Meta server fleet decreased > > by almost 10% with conventional swap setup (zswap disabled). > > > > [1]https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d0= 3d1344dde9fce0@epcms5p1/ > > > > Signed-off-by: Usama Arif > > This is awesome. > > > --- > > include/linux/swap.h | 1 + > > mm/page_io.c | 86 ++++++++++++++++++++++++++++++++++++++++++-- > > mm/swapfile.c | 10 ++++++ > > 3 files changed, 95 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index a11c75e897ec..e88563978441 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -299,6 +299,7 @@ struct swap_info_struct { > > signed char type; /* strange name for an index */ > > unsigned int max; /* extent of the swap_map */ > > unsigned char *swap_map; /* vmalloc'ed array of usage coun= ts */ > > + unsigned long *zeromap; /* vmalloc'ed bitmap to track zer= o pages */ > > One bit per swap slot, so 1 / (4096 * 8) =3D 0.003% static memory > overhead for configured swap space. That seems reasonable for what > appears to be a fairly universal 10% reduction in swap IO. > > An alternative implementation would be to reserve a bit in > swap_map. This would be no overhead at idle, but would force > continuation counts earlier on heavily shared page tables, and AFAICS > would get complicated in terms of locking, whereas this one is pretty > simple (atomic ops protect the map, swapcache lock protects the bit). > > So I prefer this version. But a few comments below: I am wondering if it's even possible to take this one step further and avoid reclaiming zero-filled pages in the first place. Can we just unmap them and let the first read fault allocate a zero'd page like uninitialized memory, or point them at the zero page and make them read-only, or something? Then we could free them directly without going into the swap code to begin with. That's how I thought about it initially when I attempted to support only zero-filled pages in zswap. It could be a more complex implementation though. [..] > > + > > +static void swap_zeromap_folio_set(struct folio *folio) > > +{ > > + struct swap_info_struct *sis =3D swp_swap_info(folio->swap); > > + swp_entry_t entry; > > + unsigned int i; > > + > > + for (i =3D 0; i < folio_nr_pages(folio); i++) { > > + entry =3D page_swap_entry(folio_page(folio, i)); > > + bitmap_set(sis->zeromap, swp_offset(entry), 1); > > This should be set_bit(). bitmap_set() isn't atomic, so it would > corrupt the map on concurrent swapping of other zero pages. And you > don't need a range op here anyway. It's a shame there is no range version of set_bit(). I suspect we can save a few atomic operations on large folios if we write them in chunks rather than one by one.