From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4F62C25B10 for ; Mon, 13 May 2024 07:30:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C99166B0271; Mon, 13 May 2024 03:30:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C48526B0272; Mon, 13 May 2024 03:30:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B10136B0273; Mon, 13 May 2024 03:30:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 903C76B0271 for ; Mon, 13 May 2024 03:30:17 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DB94AC0C04 for ; Mon, 13 May 2024 07:30:16 +0000 (UTC) X-FDA: 82112549232.11.5CAEB17 Received: from mail-vk1-f174.google.com (mail-vk1-f174.google.com [209.85.221.174]) by imf28.hostedemail.com (Postfix) with ESMTP id 04E68C0027 for ; Mon, 13 May 2024 07:30:13 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eMHygwcw; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715585414; a=rsa-sha256; cv=none; b=6HDpfZYr/5dbXbZRqwNVrr0p4Kviy9JCCLo9QxAwfKCoZ2NRYltDbjHvDX7j0IFI6CZxFv P9oTQSP1IPBPQYwafZ9Csgykje9j+BnjbFP9urUQsFO/MfDcOFF0W1gxb8a+zUVGyxp48f 4axhwtUSoXKiYP2foPYXA18n8ASovVk= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eMHygwcw; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715585414; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=B7MVXlasaqRqBxvZbiRSfdDxfyGfaNmZTQnwh8eaaqw=; b=csyA9/jUo9FNPoaU0d+I6I9gLwXb6F9eLvWALqMs120NkyWiu5GQ6V4w2FhmI3PETb7ZBM Iy5aFh4FjR2SuoWhkLh6OZah/gV0TLIOD3nOyQVzI6MAw6bNx49nDEt7GTBK9NDL3GDsYc 2Pk54ShHCpM650+14LccN0zuzBg18yY= Received: by mail-vk1-f174.google.com with SMTP id 71dfb90a1353d-4df6a1a46c7so1446533e0c.1 for ; Mon, 13 May 2024 00:30:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715585413; x=1716190213; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=B7MVXlasaqRqBxvZbiRSfdDxfyGfaNmZTQnwh8eaaqw=; b=eMHygwcwc69vQ9Hwisz2pYfBFhoONdrgBC0/QJ8c2j6XT8UG+AKldMEIvcKRl2/7F4 yxXTuyr+teC1acT4Gb5RHiBZIwI0Bkl+QK08ucZB2oGXaAzrj6xj70UTT/Dy+nK7FnJt 4x+LNBhCc3EUnwQxp2GLFkjU8sIIlhJNtlWek/S5Bvm/jkNQcyk9O2K/dedxj2MypDWI dhKrbGfr0MtNfoF6tZ1YJS8buOvVBxubBBftv2n71TJVzVYv6R7j2+Z5a3GGJkjKNpde S+yiprO0JApszJupdmoWgHB87/py56S/nst8Z6BueAXE9kHVluG7VQEfoG/9vl3vyS8F gpmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715585413; x=1716190213; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=B7MVXlasaqRqBxvZbiRSfdDxfyGfaNmZTQnwh8eaaqw=; b=Rmr2LAv6PfIGSuowbJPWpokg6KeDqSpZ0vQcUvuHdwA6UjwcK+oW/CKWFOvMCCLkIC MMcxaV0bDkqK1WqLaR/20VU+e+PBkLcCSQqctP4Vio1TMOsN0u0JY7Ski5I/SPcwMvBB qWiHwBzvY+YcW+l0X8pDCgdTRm+ZAUhkUHT6gkD/74bCa3uQLVZJRRk7q7dcb5hPbv+Z WKSh+6sy3+NGmG6sZ7UABeT0GA3ACBOhn0sLB9+ixtT6EFwkQl05SBX27mCD6nJXP1sL 5JVpqn+cXArA8FykoBprcaM0u4rMA1CHuvoZqNoPnzHFPiGJWPekwSjKGJ0ayFkN0zNW ik4g== X-Forwarded-Encrypted: i=1; AJvYcCWvxcaIzcudSPEKT5af3VGdadYnr2TG96wq57tb6jberYXzQW6NwqyUKzbil26ooMG00ajUfuuQ7EELTH3sz37MiB4= X-Gm-Message-State: AOJu0Yxt/zRtVKqeXHgVgebflChVTDzin5PYwiNKdETO0BFd7QthqgEN FP41iHiFT+BgM3E2wszPoI451r8qKB/b82mK/fDVVFqPM+Wx4kKhjUJ7HiuPjgq34EykU5nW/ex 24fj0lTJmQsn3UOFBmgPWgdjWgQY= X-Google-Smtp-Source: AGHT+IGcbdFURUIeIMJzz8DQU6sH8nqQvbEoyc5/RnSKRlLbEyNg2Vf5Jgf84i+r0BLPqSo+pEewAdeLgXSocNcQwaY= X-Received: by 2002:a05:6122:98b:b0:4d3:3b1b:aa92 with SMTP id 71dfb90a1353d-4df8833c02bmr6757057e0c.11.1715585412835; Mon, 13 May 2024 00:30:12 -0700 (PDT) MIME-Version: 1.0 References: <20240408183946.2991168-1-ryan.roberts@arm.com> <20240408183946.2991168-6-ryan.roberts@arm.com> In-Reply-To: <20240408183946.2991168-6-ryan.roberts@arm.com> From: Barry Song <21cnbao@gmail.com> Date: Mon, 13 May 2024 19:30:01 +1200 Message-ID: Subject: Re: [PATCH v7 5/7] mm: swap: Allow storage of all mTHP orders To: Ryan Roberts Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Chris Li , Lance Yang , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 04E68C0027 X-Stat-Signature: fm7isfhycwr4tn9tiyc13src3fz3obrz X-HE-Tag: 1715585413-638910 X-HE-Meta: U2FsdGVkX193JAsZmT//HGKoFD+7kfTsI/6ZcWyJYPdPhSqt5GWI2Ny+a8WvXDaac9rhEeV0Y+JgU6SJrUddsJMSChZe3w2A47drVQiBW7uoFDU2gBTsms/IH9fblQsyiBoL7Gc/6BWAhdJxM5mGTcCuKzvce3VrWfrB33LQTvePkJFVWFSPsiTmbt7PVzGT17oSjgeRoVVBmYWYw/NAJe67pW2xfCd303iPNcaft5r6eBk5klU9ovaofaLaS6BWHikEbE2lEHFxMG21R21rI3obY4mYPItvYrccKMNjzPEYBrj8oK9YuTAlr+LNEDZj72YwA4LrydiRo/ebPFORAZLOQFkPdVyPkebVyUpNfUrpArSIWX0gmoORSMXUWkYhEd/BDV3hDj09TEx23m0hYjfOZJIgnoi/7oqz5BLuI8qTAjcQigzgUG7rbIxH+DL+iERIzcH15bpGiOboM8e8VaapIhe6lzDWxwcoR52hsUPB4D+iOD30598B4Rbyw0NsYk3XlTpEucDgIQT6FZgIe3YZEiWvxTZXTOBLAlu6Tdpj9lD4iLkF7qOkWVInABHCQ62iK0cC8hpJ9X/paOQwamAfNVxt5o/oiOWJ4P4QMlrFYyjgNZ367wXzr8Fs9C1zuVoWRfQBOuIKSWYqKAhG4lA1cRzE9L6W769K6KX6S/38nmymx64BNFFTcuZ9GMVMMSD69HbZkIRRP63jtsASwwBfqzciKTMBXDkHprdmZ8mqdPRjxbBinrg7oao7JWh92Wq19UEWXxYwU0AhtRjFy4D+NcW/+am47Snm5IrXvkr0/gnlizavTaFZ3xQWLnLZ94Y6jAgmKV94Aef3RSwrlBrgw5Xzq24FGzH/TYa47u+mXDwskLrsHLubo2zK/e7G2JGtbKtJjhd5zMJI5KxzwE/9sAiKQmWJXlgCx9xNHWBvBO8f3grg44RDRVuz/U0PO9KsdwHxry05oO2kn7s jf+E5kMz imjHPc9mVXK9P4Jf7+5xEar30QDtVZEpCtMPQl8xG32PCoVLWWdSWTtElCCefO3C1IQsyBCdBn/uVZ1aDfsb43b+wkUpLWosd7NEyII00MbUYrpeALFhWMTm9SNAgDmhqf+2hh7fd8/YUcrLL5dfPH8lCMGQHCVJlBBzhcJGjO88APnssEo8wcM9EMhYEYl4VZGeV3Cn7B8811nn0TgRU78MFxP4OzQ+hsqRMfNc02hwZLQMdMUdlxmf4ZR0bPnLoP6k93OgmP+Xn9FeaR7L7I1wFnQOTdSzafRfnhvSkeynNKQGxK8Ul9z72yfPwLQDbRJqbzMK39e4Mnu3wMR2gxzjrG+g5gdtCSGfRDdcXv7DRCHDCP0FPHpFABM/QFh1oExo8aPrZQjHiAE+c3MgpUhVgxN8fel3PetKDgwv9RRgxGp7+C8n6sZ+C6g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 9, 2024 at 6:40=E2=80=AFAM Ryan Roberts = wrote: > > Multi-size THP enables performance improvements by allocating large, > pte-mapped folios for anonymous memory. However I've observed that on an > arm64 system running a parallel workload (e.g. kernel compilation) > across many cores, under high memory pressure, the speed regresses. This > is due to bottlenecking on the increased number of TLBIs added due to > all the extra folio splitting when the large folios are swapped out. > > Therefore, solve this regression by adding support for swapping out mTHP > without needing to split the folio, just like is already done for > PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, > and when the swap backing store is a non-rotating block device. These > are the same constraints as for the existing PMD-sized THP swap-out > support. > > Note that no attempt is made to swap-in (m)THP here - this is still done > page-by-page, like for PMD-sized THP. But swapping-out mTHP is a > prerequisite for swapping-in mTHP. > > The main change here is to improve the swap entry allocator so that it > can allocate any power-of-2 number of contiguous entries between [1, (1 > << PMD_ORDER)]. This is done by allocating a cluster for each distinct > order and allocating sequentially from it until the cluster is full. > This ensures that we don't need to search the map and we get no > fragmentation due to alignment padding for different orders in the > cluster. If there is no current cluster for a given order, we attempt to > allocate a free cluster from the list. If there are no free clusters, we > fail the allocation and the caller can fall back to splitting the folio > and allocates individual entries (as per existing PMD-sized THP > fallback). > > The per-order current clusters are maintained per-cpu using the existing > infrastructure. This is done to avoid interleving pages from different > tasks, which would prevent IO being batched. This is already done for > the order-0 allocations so we follow the same pattern. > > As is done for order-0 per-cpu clusters, the scanner now can steal > order-0 entries from any per-cpu-per-order reserved cluster. This > ensures that when the swap file is getting full, space doesn't get tied > up in the per-cpu reserves. > > This change only modifies swap to be able to accept any order mTHP. It > doesn't change the callers to elide doing the actual split. That will be > done in separate changes. > > Reviewed-by: "Huang, Ying" > Signed-off-by: Ryan Roberts > --- > include/linux/swap.h | 8 ++- > mm/swapfile.c | 162 ++++++++++++++++++++++++------------------- > 2 files changed, 98 insertions(+), 72 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index b888e1080a94..11c53692f65f 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -268,13 +268,19 @@ struct swap_cluster_info { > */ > #define SWAP_NEXT_INVALID 0 > > +#ifdef CONFIG_THP_SWAP > +#define SWAP_NR_ORDERS (PMD_ORDER + 1) > +#else > +#define SWAP_NR_ORDERS 1 > +#endif > + > /* > * We assign a cluster to each CPU, so each CPU can allocate swap entry = from > * its own cluster and swapout sequentially. The purpose is to optimize = swapout > * throughput. > */ > struct percpu_cluster { > - unsigned int next; /* Likely next allocation offset */ > + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offs= et */ > }; > > struct swap_cluster_list { > diff --git a/mm/swapfile.c b/mm/swapfile.c > index d2e3d3cd439f..148ef08f19dd 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *s= i, unsigned long idx) > > /* > * The cluster corresponding to page_nr will be used. The cluster will b= e > - * removed from free cluster list and its usage counter will be increase= d. > + * removed from free cluster list and its usage counter will be increase= d by > + * count. > */ > -static void inc_cluster_info_page(struct swap_info_struct *p, > - struct swap_cluster_info *cluster_info, unsigned long page_nr) > +static void add_cluster_info_page(struct swap_info_struct *p, > + struct swap_cluster_info *cluster_info, unsigned long page_nr, > + unsigned long count) > { > unsigned long idx =3D page_nr / SWAPFILE_CLUSTER; > > @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_s= truct *p, > if (cluster_is_free(&cluster_info[idx])) > alloc_cluster(p, idx); > > - VM_BUG_ON(cluster_count(&cluster_info[idx]) >=3D SWAPFILE_CLUSTER= ); > + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CL= USTER); > cluster_set_count(&cluster_info[idx], > - cluster_count(&cluster_info[idx]) + 1); > + cluster_count(&cluster_info[idx]) + count); > +} > + > +/* > + * The cluster corresponding to page_nr will be used. The cluster will b= e > + * removed from free cluster list and its usage counter will be increase= d by 1. > + */ > +static void inc_cluster_info_page(struct swap_info_struct *p, > + struct swap_cluster_info *cluster_info, unsigned long page_nr) > +{ > + add_cluster_info_page(p, cluster_info, page_nr, 1); > } > > /* > @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_st= ruct *p, > */ > static bool > scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, > - unsigned long offset) > + unsigned long offset, int order) > { > struct percpu_cluster *percpu_cluster; > bool conflict; > @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info= _struct *si, > return false; > > percpu_cluster =3D this_cpu_ptr(si->percpu_cluster); > - percpu_cluster->next =3D SWAP_NEXT_INVALID; > + percpu_cluster->next[order] =3D SWAP_NEXT_INVALID; > + return true; > +} > + > +static inline bool swap_range_empty(char *swap_map, unsigned int start, > + unsigned int nr_pages) > +{ > + unsigned int i; > + > + for (i =3D 0; i < nr_pages; i++) { > + if (swap_map[start + i]) > + return false; > + } > + > return true; > } > > /* > - * Try to get a swap entry from current cpu's swap entry pool (a cluster= ). This > - * might involve allocating a new cluster for current CPU too. > + * Try to get swap entries with specified order from current cpu's swap = entry > + * pool (a cluster). This might involve allocating a new cluster for cur= rent CPU > + * too. > */ > static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, > - unsigned long *offset, unsigned long *scan_base) > + unsigned long *offset, unsigned long *scan_base, int order) > { > + unsigned int nr_pages =3D 1 << order; > struct percpu_cluster *cluster; > struct swap_cluster_info *ci; > unsigned int tmp, max; > > new_cluster: > cluster =3D this_cpu_ptr(si->percpu_cluster); > - tmp =3D cluster->next; > + tmp =3D cluster->next[order]; > if (tmp =3D=3D SWAP_NEXT_INVALID) { > if (!cluster_list_empty(&si->free_clusters)) { > tmp =3D cluster_next(&si->free_clusters.head) * > @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct sw= ap_info_struct *si, > > /* > * Other CPUs can use our cluster if they can't find a free clust= er, > - * check if there is still free entry in the cluster > + * check if there is still free entry in the cluster, maintaining > + * natural alignment. > */ > max =3D min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLU= STER)); > if (tmp < max) { > ci =3D lock_cluster(si, tmp); > while (tmp < max) { > - if (!si->swap_map[tmp]) > + if (swap_range_empty(si->swap_map, tmp, nr_pages)= ) > break; > - tmp++; > + tmp +=3D nr_pages; > } > unlock_cluster(ci); > } > if (tmp >=3D max) { > - cluster->next =3D SWAP_NEXT_INVALID; > + cluster->next[order] =3D SWAP_NEXT_INVALID; > goto new_cluster; > } > *offset =3D tmp; > *scan_base =3D tmp; > - tmp +=3D 1; > - cluster->next =3D tmp < max ? tmp : SWAP_NEXT_INVALID; > + tmp +=3D nr_pages; > + cluster->next[order] =3D tmp < max ? tmp : SWAP_NEXT_INVALID; > return true; > } > > @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct= swap_info_struct *si, > > static int scan_swap_map_slots(struct swap_info_struct *si, > unsigned char usage, int nr, > - swp_entry_t slots[]) > + swp_entry_t slots[], int order) > { > struct swap_cluster_info *ci; > unsigned long offset; > unsigned long scan_base; > unsigned long last_in_cluster =3D 0; > int latency_ration =3D LATENCY_LIMIT; > + unsigned int nr_pages =3D 1 << order; > int n_ret =3D 0; > bool scanned_many =3D false; > > @@ -817,6 +846,25 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, > * And we let swap pages go all over an SSD partition. Hugh > */ > > + if (order > 0) { > + /* > + * Should not even be attempting large allocations when h= uge > + * page swap is disabled. Warn and fail the allocation. > + */ > + if (!IS_ENABLED(CONFIG_THP_SWAP) || > + nr_pages > SWAPFILE_CLUSTER) { > + VM_WARN_ON_ONCE(1); > + return 0; > + } > + > + /* > + * Swapfile is not block device or not using clusters so = unable > + * to allocate large entries. > + */ > + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) > + return 0; > + } > + > si->flags +=3D SWP_SCANNING; > /* > * Use percpu scan base for SSD to reduce lock contention on > @@ -831,8 +879,11 @@ static int scan_swap_map_slots(struct swap_info_stru= ct *si, > > /* SSD algorithm */ > if (si->cluster_info) { > - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_bas= e)) > + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_bas= e, order)) { Hi Ryan, Sorry for bringing up an old thread. During the initial hour of utilizing an Android phone with 64KiB mTHP, we noticed that the anon_swpout_fallback rate was less than 10%. However, after several hours of phone usage, we observed a significant increase in the anon_swpout_fallback rate, reaching 100%. As I checked the code of scan_swap_map_try_ssd_cluster(), static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, unsigned long *offset, unsigned long *scan_base, int order) { unsigned int nr_pages =3D 1 << order; struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int tmp, max; new_cluster: cluster =3D this_cpu_ptr(si->percpu_cluster); tmp =3D cluster->next[order]; if (tmp =3D=3D SWAP_NEXT_INVALID) { if (!cluster_list_empty(&si->free_clusters)) { tmp =3D cluster_next(&si->free_clusters.head) * SWAPFILE_CLUSTER; } else if (!cluster_list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some cluster= s in * discarding, do discard now and reclaim them, the= n * reread cluster_next_cpu since we dropped si->loc= k */ swap_do_scheduled_discard(si); *scan_base =3D this_cpu_read(*si->cluster_next_cpu)= ; *offset =3D *scan_base; goto new_cluster; } else return false; } ... } Considering the cluster_list_empty() checks, is it necessary to have free_cluster to ensure a continuous allocation of swap slots for large folio swap out? For instance, if numerous clusters still possess ample free swap slots, could we potentially miss out on them due to a lack of execution of a slow scan? I'm not saying your patchset has problems, just that I have some questions. Thanks Barry