From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A736C3DA64 for ; Tue, 6 Aug 2024 07:42:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3DDA6B008C; Tue, 6 Aug 2024 03:42:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AEDB26B0092; Tue, 6 Aug 2024 03:42:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B5986B0093; Tue, 6 Aug 2024 03:42:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 77D3D6B008C for ; Tue, 6 Aug 2024 03:42:27 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 347F4A0854 for ; Tue, 6 Aug 2024 07:42:27 +0000 (UTC) X-FDA: 82421027934.02.6399F86 Received: from mail-ua1-f46.google.com (mail-ua1-f46.google.com [209.85.222.46]) by imf09.hostedemail.com (Postfix) with ESMTP id 78128140009 for ; Tue, 6 Aug 2024 07:42:25 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="hel/fKQz"; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722930114; a=rsa-sha256; cv=none; b=d09BeVa+WQZur4mSceEIqr7BBpCNtVDKeU/HLfeoKJeGWjIqjvxxFWGfvkp6hOMwE9haD+ z8MgxWMgLhJTpjFj1t/qGxP6bVORVo+f93BpoaoVP2qh0qdKxF5J64Mgvg9vCRLLY6K91I EJwby0YTHAIkRvxVA4gbUYuCoj8NSRI= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="hel/fKQz"; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.46 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722930114; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=z72ZH5WstJ8PQ6OG5yNQlTG29Wc6jbLbmu7WIcbAjCQ=; b=swUOmz/diciGtOSzlkwojsMS/12r62nGA747jv3ePBvsRk9YZFIwOS3BQyHZIOrsLpQgh+ Y0vJbwlQnbqGdmvtGRsgKqqukYL3PtwtIkU6Yamkly5dtpY9Rll2doNeHDANeA67HO4H0L /JmjV3uAkrOKH+40PhKUAKEloFNcJDk= Received: by mail-ua1-f46.google.com with SMTP id a1e0cc1a2514c-825896b2058so82368241.1 for ; Tue, 06 Aug 2024 00:42:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722930144; x=1723534944; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=z72ZH5WstJ8PQ6OG5yNQlTG29Wc6jbLbmu7WIcbAjCQ=; b=hel/fKQzU9sHD/p54HO+Ms6MZz7NCFYqRkyVQQyuajxxyFM4D5Exe5KRoLBgQYrXRg L2qNZPjs74SHXr9tDVgu6ihrVZ3UtXT1xtqa7y87+MB39SQmwiGB2VfuahqRW0WFFVqy TG/KMhBcjQugY26BVfQmBzPmi8eukSStBeUCsSPiD1RNGVsDtxErhZfr2WP+qBhTNx6a kbq4P19rgMuwPHKcK+rwx4QO+/L+0UKR8WssSg31dJN4Be2AxSvO34xERUQbnBwIgTrN RSgSBuC9DTwXEW9xbYpzUCsfkWwbCQKdrBi4MwIQmvtjnLFv46Ff96b9FvlCJulUXuje +BUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722930144; x=1723534944; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=z72ZH5WstJ8PQ6OG5yNQlTG29Wc6jbLbmu7WIcbAjCQ=; b=GuDfzjaWVCESL20tbkJTL+cXc6H8RaYbdfuWUUPgyJYPNVG9TWn5uFQ+yk+FitfzBP U2lsJAD7yTXVz8+YTOdnlFXtDh9w6o3EYUggVR0MNldeXOlblOv4cvYtjP8OfyvTay6v KsaAAvTBcuQGwHB19OVrsqtB2l1xTe+99GkzVRsB14j11ltq+7ACklcPdXCf/D1LVSRy Ux8ILDSgAhGb3JeCQxC0dio/4A4n2914uyi7g16VTZxuTt8FbNfIR4OReHdh9eD81Whl dXnyw1ZeYdSAnr5WfkUftTtsYZAtAbqkzlwd3zMmkS3lvgLmYjF+N95VVp+mFSiUBPZq jcyg== X-Forwarded-Encrypted: i=1; AJvYcCW1YzSC+xTGtE/XwxhYDbu8kv2Y1OI8MRQtUmQqB+k35qOeCtKRWTqbP8WhakiNO2lV/HA5mXbAydzhO+9wvYhxkJs= X-Gm-Message-State: AOJu0Yy8By4rn2Ns4WFWpouVxG+VUA7cVPRFdAqyRlL21mdH1W6oI9J8 TxvxM4gBPrV8EIO1mCYLgLbe3UENEjdZpR0gC7zzr5BAKlyvZrnYq6e+F7AzLtsilG4zevWL9wL GP90OlHYvECN9pYU1+wTh9hbWzGo= X-Google-Smtp-Source: AGHT+IECDCtz4JVrjWyfHJK+FGbR5G0jM+xZkmlR7TiDbL+YPPYhJ4hKuUqFe73tLk04APgc5J7pofvlF6Hve8rxjA8= X-Received: by 2002:a05:6102:c03:b0:48f:e8f9:5d9f with SMTP id ada2fe7eead31-4945bddc51fmr11298182137.7.1722930144277; Tue, 06 Aug 2024 00:42:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 6 Aug 2024 14:38:13 +0800 Message-ID: Subject: Re: [PATCH] mm: attempt to batch free swap entries for zap_pte_range() To: 20240806012409.61962-1-21cnbao@gmail.com Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song , Kairui Song , Chris Li , "Huang, Ying" , Hugh Dickins , Kalesh Singh , Ryan Roberts , David Hildenbrand Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: fsuan5zr49uicmiyr4b49t9b1g5cd5mr X-Rspamd-Queue-Id: 78128140009 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1722930145-716426 X-HE-Meta: U2FsdGVkX1+8n9IaxHIRuR2Isk9++I17HGsmr6J0GlVMz6DL6k8BONNZTFcyFDYmuwWe1WpoDfwmECYHJe8v3qVDCtsCqNBSciFPPCA7nYZhfrMZVV896xSqh6eek83mth8SVZUmNkJI4kLU+NkStcHDBZWJwJt6NYBJPBle4C5AuI/q8t8NVym1+xHxe+lVB7ESeKsbXzULi2eas+hIILoxzOsOnUwqo9tbZ+HOzQvksXhM2u6GE1VYfaSii35ipV3pRW5QAohJVEa/CHGONTtohWXjMzY29bDs0b4eWV3smDx7TWhryzTi53kV9AAfR+n2SaUt91bUP2EIc9J6lmmQ/G+/qrEJio0CQk5ZWBmvgx5tbqxxGeu0m+4QusCJboc6S1/PfWzJsHgmwJkH0jt8eTOz97hS7LPcx0JodPonofiAoPeuY1Di32YEoefTPK0r3RjTg5xp2qxPDuGl02hs/YuIS9RjSwTVdMSK/bxxGdyZl6lCjmoZM0K1cYLMpRBc9/6+e5KluMG/tyV72jI9zcCEWEqVNFHD8nsQ70MaQdClwmER5f6hR4kc8a09d0nw6sCbPedQ1QVIXBMQlo+LbKjuwo+/ZkGaO1ARsY0UcapcZveE53d5rkBYjU+rMJuXqmOXPV04f4AhrTErBSajcJudRh83gQTNsNxFuuIdii34JEI+bzIDQERcs+O1/dLgYQpEXardxzNL791vSGn89DzHRvWzVdzwC/XE4GJB3A3Ir9HZjbXf56hX1o4rjX5bJVZ0G7AaufxFXYTe/DULRea2whYrU36uKVi6wGcpnXlx/I7tdAD4h7E3VEHDZd/olZLALh7XPtUKKb7hPcerN5UADldMVlkTmT9+nbuVJDARhI/PygfC/7fszxrPOlxwg74TYOBP4u62S9u5k4zZeb1twQBDcC5nmjhMISI0tHrteGG/tgbN4AaGvVUjJpPbknBFyExS7LEfGKe 4OHMAvKt 3IJWWquP3GryT7oL4oRL5ns7FarL3e64wmTaW45o92/+mqSOKFNU+gvXx95ZJoSnamVX8yQ2VWJibXnp4G/k46qOounmAPWjqgWc2XpuapER1HE53pObnb93SpLAytAK75vBBzjYif2gG+WzVtVtNWMcuNDBZ+3iQPiq5GfPQOADhwJPYSB3eD3bNKVeaH5bGXQ6MmejnJOsvzgoRnw3lzYSIYs6SKkPLzFXVf5oLgxesJzwnuTugydUVLFpHQ+uhHjzEessOQ3GnutZdwEkObXJZxozc9X2dHefOKgOSsAXqdPCENFhVla5LA8G3sjNThi/buKoVXqqZlEPX3sm+Oe+BCJ3o9MEuab/OsOCnNK2xpPmKRpGrWsHI4l3/YE81Qi2RpP08G6J2bRWZAgLyYRsfNr8LkFEDeq0+Eabs6RJSCfpwICRf9EstN+wLYb99BsImfI/Ku0LrXuXEUBZeQ+VUXw4yqf6hxZ8ZJHvcNWBmnSB+aGg0Hvfw/zx1lfNBakgrn2Um21KX3VuPchSek89iMeTzSNelJyaqHQZTIVltU/6NhwxRBxon56if5EJZdMh7QesFUTqUG86F0xC4Z1I9sWRuMOb2KzCdLIRlt9jKd2WravkTTu+pLpK7+5PIR1zyrFqSg9PSgSaUXmAHDQrsw6k6VpeqULeeRx3EYzxSbHlzCCUjaadY6NISd2CbkrA+77zprt+EgJRPVNOOdK/I+1n+Pw+TJUhvZcC3KLkUzUV60Zq324cerT9vpZWba9sc46eu9536gwOUI5+hU0a8PMaJv+wlTV/mgr2B4sebl9ADKUhK+SiF1A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Next time, please use "> " and ">> " etc to reply to emails. On Tue, Aug 6, 2024 at 3:23=E2=80=AFPM zhiguojiang w= rote: > > From: Barry Song > > Zhiguo reported that swap release could be a serious bottleneck > during process exits[1]. With mTHP, we have the opportunity to > batch free swaps. > Thanks to the work of Chris and Kairui[2], I was able to achieve > this optimization with minimal code changes by building on their > efforts. > If swap_count is 1, which is likely true as most anon memory are > private, we can free all contiguous swap slots all together. > > Ran the below test program for measuring the bandwidth of munmap > using zRAM and 64KiB mTHP: > > #include > #include > #include > > unsigned long long tv_to_ms(struct timeval tv) > { > return tv.tv_sec * 1000 + tv.tv_usec / 1000; > } > > main() > { > struct timeval tv_b, tv_e; > int i; > #define SIZE 1024*1024*1024 > void *p =3D mmap(NULL, SIZE, PROT_READ | PROT_WRITE, > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > if (!p) { > perror("fail to get memory"); > exit(-1); > } > > madvise(p, SIZE, MADV_HUGEPAGE); > memset(p, 0x11, SIZE); /* write to get mem */ > > madvise(p, SIZE, MADV_PAGEOUT); > > gettimeofday(&tv_b, NULL); > munmap(p, SIZE); > gettimeofday(&tv_e, NULL); > > printf("munmap in bandwidth: %ld bytes/ms\n", > SIZE/(tv_to_ms(tv_e) - tv_to_ms(tv_b))); > } > > The result is as below (munmap bandwidth): > mm-unstable mm-unstable-with-patch > round1 21053761 63161283 > round2 21053761 63161283 > round3 21053761 63161283 > round4 20648881 67108864 > round5 20648881 67108864 > > munmap bandwidth becomes 3X faster. > > [1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@viv= o.com/ > [2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148= b9297@kernel.org/ > > Cc: Kairui Song > Cc: Chris Li > Cc: "Huang, Ying" > Cc: Hugh Dickins > Cc: Kalesh Singh > Cc: Ryan Roberts > Cc: David Hildenbrand > Signed-off-by: Barry Song > --- > mm/swapfile.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 61 insertions(+) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index ea023fc25d08..ed872a186e81 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struc= t *si, > return true; > } > > +static bool swap_is_last_map(struct swap_info_struct *si, > + unsigned long offset, int nr_pages, > + bool *has_cache) > +{ > + unsigned char *map =3D si->swap_map + offset; > + unsigned char *map_end =3D map + nr_pages; > + bool cached =3D false; > + > + do { > + if ((*map & ~SWAP_HAS_CACHE) !=3D 1) > + return false; > + if (*map & SWAP_HAS_CACHE) > + cached =3D true; > + } while (++map < map_end); > + > + *has_cache =3D cached; > + return true; > +} > + > /* > * returns number of pages in the folio that backs the swap entry. If p= ositive, > * the folio was reclaimed. If negative, the folio was not reclaimed. I= f 0, no > @@ -1469,6 +1488,39 @@ static unsigned char __swap_entry_free(struct swap= _info_struct *p, > return usage; > } > > +static bool try_batch_swap_entries_free(struct swap_info_struct *p, > + swp_entry_t entry, int nr, bool *any_only_cache) > +{ > + unsigned long offset =3D swp_offset(entry); > + struct swap_cluster_info *ci; > + bool has_cache =3D false; > + bool can_batch; > + int i; > + > + /* cross into another cluster */ > + if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER) > + return false; > My understand of mTHP swap entries alloced by by cluster_alloc_swap() > is that they belong to the same cluster in the same swapinfo , so > theoretically it will not appear for > (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)? > Can you help confirm? zap_pte_range() has no concept of folios (mTHP) as folios could have gone. you could have the case: folio1: last 16 slots of cluster1 folio2: first 16 slots of cluster2. folio1 and folio2 are within the same PMD and virtually contiguous before they are unmapped. when both folio1 and folio2 have gone, zap_pte_range() 's nr =3D swap_pte_batch(pte, max_nr, ptent); nr will be 32. "mTHP swap entries alloced by by cluster_alloc_swap() belon= g to the same cluster" is correct, but when you zap_pte_range(), your mTHPs could have gone. > > + ci =3D lock_cluster_or_swap_info(p, offset); > + can_batch =3D swap_is_last_map(p, offset, nr, &has_cache); > + if (can_batch) { > + for (i =3D 0; i < nr; i++) > + WRITE_ONCE(p->swap_map[offset + i], SWAP_HAS_CACH= E); > + } > + unlock_cluster_or_swap_info(p, ci); > + > + /* all swap_maps have count=3D=3D1 and have no swapcache */ > + if (!can_batch) > + goto out; > + if (!has_cache) { > + spin_lock(&p->lock); > + swap_entry_range_free(p, entry, nr); > + spin_unlock(&p->lock); > + } > + *any_only_cache =3D has_cache; > +out: > + return can_batch; > +} > + > /* > * Drop the last HAS_CACHE flag of swap entries, caller have to > * ensure all entries belong to the same cgroup. > @@ -1797,6 +1849,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int = nr) > bool any_only_cache =3D false; > unsigned long offset; > unsigned char count; > + bool batched; > > if (non_swap_entry(entry)) > return; > @@ -1808,6 +1861,13 @@ void free_swap_and_cache_nr(swp_entry_t entry, int= nr) > if (WARN_ON(end_offset > si->max)) > goto out; > > + if (nr > 1 && swap_count(data_race(si->swap_map[start_offset]) = =3D=3D 1)) { > + batched =3D try_batch_swap_entries_free(si, entry, nr, > + &any_only_cache); > + if (batched) > + goto reclaim; > + } > The mTHP swap entries are batch freed as a whole directly by skipping > percpu swp_slots caches, instead of freeing every swap entry separately, > which can accelerate the mTHP swap entries release. I think it is > valuable. yes. I have seen 3X performance improvement. > > + > /* > * First free all entries in the range. > */ > @@ -1821,6 +1881,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int = nr) > } > } > > +reclaim: > /* > * Short-circuit the below loop if none of the entries had their > * reference drop to zero. > -- > 2.34.1 > > Thanks > Zhiguo > > > Thanks Barry