From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6763C83F1A for ; Tue, 22 Jul 2025 17:45:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6FEAD6B008C; Tue, 22 Jul 2025 13:45:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6AFA56B0095; Tue, 22 Jul 2025 13:45:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 59ED86B00A3; Tue, 22 Jul 2025 13:45:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 470E76B008C for ; Tue, 22 Jul 2025 13:45:33 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id E229DC04A8 for ; Tue, 22 Jul 2025 17:45:31 +0000 (UTC) X-FDA: 83692627662.10.41BB976 Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by imf06.hostedemail.com (Postfix) with ESMTP id B946818000A for ; Tue, 22 Jul 2025 17:45:29 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Dq4clT+o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1753206330; a=rsa-sha256; cv=none; b=Rc8yJEK67RbD8Wz3zorkCttr1TTIgwDP/1v8R0XU2Ng/CxKbMZDA6mXnnObpCOTpbnelx0 Hj/vV3TC/H3S1nYQikOJYZen+2waSO+YdvGLXssWsfYbIbuymPgyKFRN12iQL4FkwElQFp qjLw+CNcPSJhrQZKgqiQ2Cm2Ufq9WTA= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Dq4clT+o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf06.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1753206330; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XjFuwE8taiWiJN5pZzHXNakG5MTT87qFINxfeX0KwoI=; b=R1p2H5w0EbCx/MVMfCN/ruC56w4mXnwaBI9+MKbW4DTtUOiPhWJuS6vV9X/0WsbtPC1fUz nRhs0MXPZacgyjSA1EdU1H4TruF4S+m9PpVhCjcBCfdyecGnl0s0yJXzs1rzeQPeOqCOMD XOjV0s+dle/HGBY3x9qZNDSa97CHK+o= Received: by mail-lj1-f182.google.com with SMTP id 38308e7fff4ca-32b8134ef6aso58119901fa.0 for ; Tue, 22 Jul 2025 10:45:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753206328; x=1753811128; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XjFuwE8taiWiJN5pZzHXNakG5MTT87qFINxfeX0KwoI=; b=Dq4clT+ouOo5OenkdPIWYQzt2lcqeab7n4U7Ot/zCr1lKasI5a0gV+cgT1yjtOGC82 MANtgzn1nCp9VlgICAJ9Wv8R/GfK0Fh8iqd3w5cXlkOdiGpQVEcZKfS4GZcsYgr7Wi7C cTLbWpdPKnmRFyNcshgMAnWinkAjBmcx1ro84d9yJlLXHLOVDu0sF4BoXImLg6YIHiAM 6AFJ+NzCl37d1elRIkzxR83dND8kJK+VuwxdcCqA658iBfaJqPzr3S0HloN4vPEZOFDH EObUx/DiCCvqyUvyGHdMXjcO6DvvqmwS2zpjD6lGOQ9DFGH3fWElDZJd0HGDxovxRk/N gzfA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753206328; x=1753811128; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XjFuwE8taiWiJN5pZzHXNakG5MTT87qFINxfeX0KwoI=; b=BeZH8qsX2W+I9iaZetdoh79CGZSydPZwF/TPFDQw5zmuquN3nqQlEx/8Fj+X+z49VN tMCsKgQ2RlOw3bUXZYWvVRKRkSjjFvy9kRSIc3E72OqG9Ipo3KXkd0mDjo8jKw6N6Gbi bc/6OZqWy+xke2VQoDQuDv6uvl/4F+pznJM4iKHztbsxvfelDKhy8tFpli+keEGXAAnw GIBmXNSiNhaB/0HlF0PPRhKFy6fcd0db2WsgjhS7XtWXLuIsZNw2jH2814auwlUyShPz wQwP0P4+bpfSogcFBf9ZGGLs2QHr13nioUSSXBJWN6SdDpP5ev45R06fySVTR9ppus8m dqbA== X-Forwarded-Encrypted: i=1; AJvYcCX5JKQyfbAnXq3W4ek4QrLmi+31qwYI5Pnn/X/+nXVXBTwW8UFGcKsRTjHMkOCkoGE8APztEaWdgg==@kvack.org X-Gm-Message-State: AOJu0Yy/eX7NKLd1FFcYolw/eV7GvJpl5mnusdm2KZ7G5hVel6DEor2P JrpoUwVDtbiDTQ/29v4kQaocUyoPf0L8T3qIV1V8vuI02DfJ3/SbyPh0P6niIiHG9/obCFY6pKE ILYcHNnL/sil1rRff1J9nVSIQdYAmRr8= X-Gm-Gg: ASbGncv9V+NEgY1WUzAwtndHMXX4sRVAbjhqc0rVyDBwKsr0U7da20dQoaU8vz8HFk+ p0QXjIlkkYZp+RojSMTVqH+rIwc93k8BD4x9KQyHYMs5T50ovfDFnVTN39AEzMXc3FuhF6vz0c1 +8KNkPCVGDtPolcFRcpfjI0DVLF1hX3JtAbJt+Idp3/DMZoHQVoW8WWSkYY6Jbvqt+2jB/WsB0Q mn94BA= X-Google-Smtp-Source: AGHT+IFDF2mlLYaYBZGWLln98c4PCkofuMMIHb20qfTy6ayrJ+cpDu7CIoPj4/AIk3dXq3Z3l24UI3QC+X31rLhOYGA= X-Received: by 2002:a05:651c:211b:b0:32f:1f1b:1e68 with SMTP id 38308e7fff4ca-3309a4fa50bmr65974791fa.18.1753206327200; Tue, 22 Jul 2025 10:45:27 -0700 (PDT) MIME-Version: 1.0 References: <20250716202006.3640584-1-youngjun.park@lge.com> <20250716202006.3640584-5-youngjun.park@lge.com> In-Reply-To: <20250716202006.3640584-5-youngjun.park@lge.com> From: Kairui Song Date: Wed, 23 Jul 2025 01:44:49 +0800 X-Gm-Features: Ac12FXzJvzXCImBzHeVF2F_uGI5AlIAZiwACCt3ZAEB8qvt4tn9o8yZaQ-naP_E Message-ID: Subject: Re: [PATCH 4/4] mm: swap: Per-cgroup per-CPU swap device cache with shared clusters To: Youngjun Park Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, shakeel.butt@linux.dev, muchun.song@linux.dev, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, gunho.lee@lge.com, iamjoonsoo.kim@lge.com, taejoon.song@lge.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: t33ihmiasaohfz1obh6rgweb3yot5nyn X-Rspam-User: X-Rspamd-Queue-Id: B946818000A X-Rspamd-Server: rspam02 X-HE-Tag: 1753206329-950701 X-HE-Meta: U2FsdGVkX18rNpecogJne1+ywsjF+PU2VLfqZGaCBqtJIBlryPj2k5D3CyRJn/obqD6s2OvE7G/xu+dPchr/cWNyjCRBuq9iWHzPzbaU4FBj3wSZ2PmFp+SmLuN91C3RatpNV3LZvBr/LLs0Zt2cp2Y4igQVHP7OAaHvS1tfhuSmZhYdr+4XQaDTWoB2y758LNaaZo7HKkXQl1eImpxlSipbF9zO6a9tZy5N6L3h71ZFnv8z8trWSQCcRbTnCms6S15NipNLEUH2OscvXhyhcl3Y8Ro6anIVI0Lfq2axDoVaVU8pZvxAu+BOkkMvnuW9T6cbWAbnaXAsEzduapf5/wFeo5B6aWBFQm1a+J1584ExUHkDx5i9efoaVloS3PMQgOEdgoOBFt/6ISO05ewAZmVxr6GOH455u+s6RKrrN0a3BvJLLM95Yd6GrVQOkWf9kFpYsPPmF8nI++8GOhLUBgd/ryZmpgXnTMDwEdMLQVUhrCNfa8mj3KDTbOqBHAnEv3ipRcbmF4tCaxUlkrHHV46EIHzJNLd3fpyzmtSA98hfDlHWWMpwijr71MKqI0bbKNnhk4xYFkaY36V/4Lfp9OdSAFKxHmIScxRVBbPyxnMsKNxHrzFcfCqgagyZKliiRvrlOXh6nnEpKh1VSUwTZb73fXDJEpslB+qdrhcCk64KykEy5eAOMfdGpTXgI8IuvisWk01E/v5r4KPEqiITCoCPCI6EB73G/f2vR5g+OAw7Ly8HViNj55n6zib9/2mcdIetHhPw+pQ88weuRmG/gulOGOpHe8uWVWcRMZHi4s3ZBz7EE8YABFxdUqk7I+klfeKy16JT4/MgLwQZDPajHFijye2XHUhyVuh+iW6PxKnwFlZzK3QmREl5R0JrN6re9N4Pmexlal7dCormhUut3qPC/QJODJbyK6pmW0IrYERxN77l0tNP32rvpRhKosah49DpejeEACUZqEOs5eb jnAiNyNt SbzSUklRafCxpq/jSeqkWpJyx3ZBZjSr1GjXnlf6oegYxtWrPcaxswObt4faHAZHzdpXXXj/OZZihYwWPgGzkMAzwmIsV/Xc6pbnSqz9iYS/I+qcn+u66ZytdMyRqak4cIdo6KgoqT83cbPD6x9clQf0VSS+dD7xGo/ehR4KjXHiDkAdzlBSTArR3wyzsUd9dL2AEDv+ZfZ/++T0fRVMxZT+p511XIAi8hdqTvIVhdFUmLlamDtBrJFVJPRW25wjzdAp6zlt5FT8EK1ncWtuNeXuMyJYO/I63rXIkV1PLdYTgW/01X9mY6C3T2RfnLeTvXUM7P6/orke3fxiTgsZv2jpP5oZN2QzCi10v3OORVo9VwOCCRdhl/DUfoaaoy13+Bj/mgjMwBukWP9+u3bg+cLEWqjIgLTN3Q1bYmNEeXwIMP3g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jul 17, 2025 at 4:21=E2=80=AFAM Youngjun Park wrote: > > This patch introduces a new swap allocation mechanism that supports > per-cgroup per-CPU swap device caches, combined with per-device per-CPU > cluster management. > > The existing global swap allocator uses a per-CPU device cache and > cluster, shared by all cgroups. Under this model, per-cgroup swap > priorities cannot be effectively honored on the fast path, as allocations > do not distinguish between cgroups. > > To address this, we introduce per-cgroup per-CPU swap device caches. > This allows fast-path swap allocations to respect each cgroup=E2=80=99s > individual priority settings. > > To avoid an explosion of cluster structures proportional to the number > of cgroups, clusters remain per-device and are shared across cgroups. > This strikes a balance between performance and memory overhead. > > Suggested-by: Nhat Pham > Suggested-by: Kairui Song > Signed-off-by: Youngjun Park > --- > include/linux/swap.h | 7 ++ > mm/swap_cgroup_priority.c | 156 +++++++++++++++++++++++++++++++++++++- > mm/swap_cgroup_priority.h | 39 ++++++++++ > mm/swapfile.c | 47 +++++++----- > 4 files changed, 228 insertions(+), 21 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index bfddbec2ee28..ab15f4c103a1 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -283,6 +283,12 @@ enum swap_cluster_flags { > #define SWAP_NR_ORDERS 1 > #endif > > +#ifdef CONFIG_SWAP_CGROUP_PRIORITY > +struct percpu_cluster { > + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offs= et */ > +}; > +#endif > + > /* > * We keep using same cluster for rotational device so IO will be sequen= tial. > * The purpose is to optimize SWAP throughput on these device. > @@ -341,6 +347,7 @@ struct swap_info_struct { > struct list_head discard_clusters; /* discard clusters list */ > #ifdef CONFIG_SWAP_CGROUP_PRIORITY > u64 id; > + struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap= location */ > #endif > struct plist_node avail_lists[]; /* > * entries in swap_avail_heads,= one > diff --git a/mm/swap_cgroup_priority.c b/mm/swap_cgroup_priority.c > index 84e876b77f01..f960c3dcab48 100644 > --- a/mm/swap_cgroup_priority.c > +++ b/mm/swap_cgroup_priority.c > @@ -21,6 +21,17 @@ > #include "swap_cgroup_priority.h" > #include "memcontrol-v1.h" > > +/* > + * We do maintain a cache on a per-cgroup-per-swap-device basis. > + * However, the underlying cluster cache itself is managed > + * per-swap-device. This design prevents each individual > + * swap_cgroup_priority entry from caching its own cluster data, > + * even as the number of such entries increases. > + */ > +struct percpu_swap_device { > + struct swap_info_struct *si[SWAP_NR_ORDERS]; > +}; > + > static DEFINE_MUTEX(swap_cgroup_priority_inherit_lck); > static LIST_HEAD(swap_cgroup_priority_list); > > @@ -49,6 +60,7 @@ static LIST_HEAD(swap_cgroup_priority_list); > * least_priority - Current lowest priority. > * distance - Priority differences from global swap priority. > * default_prio - Default priority for this cgroup. > + * pcpu_swapdev - Per-CPU swap device. > * plist - Priority list head. > */ > struct swap_cgroup_priority { > @@ -64,6 +76,7 @@ struct swap_cgroup_priority { > int least_priority; > s8 distance; > int default_prio; > + struct percpu_swap_device __percpu *pcpu_swapdev; > struct plist_head plist[]; > }; > > @@ -132,6 +145,21 @@ static struct swap_cgroup_priority *get_effective_sw= ap_cgroup_priority( > return swap_priority->effective; > } > > +static struct swap_cgroup_priority *get_effective_swap_cgroup_priority_r= cu( > + struct mem_cgroup *memcg) > +{ > + struct swap_cgroup_priority *swap_priority; > + > + if (!memcg) > + return NULL; > + > + swap_priority =3D rcu_dereference(memcg->swap_priority); > + if (!swap_priority) > + return NULL; > + > + return rcu_dereference(swap_priority->effective); > +} > + > static bool validate_effective_swap_cgroup_priority( > struct mem_cgroup *memcg, > struct swap_cgroup_priority **swap_priority) > @@ -172,6 +200,9 @@ static void free_swap_cgroup_priority_pnode( > static void free_swap_cgroup_priority( > struct swap_cgroup_priority *swap_priority) > { > + if (swap_priority->pcpu_swapdev) > + free_percpu(swap_priority->pcpu_swapdev); > + > for (int i =3D 0; i < MAX_SWAPFILES; i++) > free_swap_cgroup_priority_pnode(swap_priority->pnode[i]); > > @@ -187,6 +218,12 @@ static struct swap_cgroup_priority *alloc_swap_cgrou= p_priority(void) > if (!swap_priority) > return NULL; > > + swap_priority->pcpu_swapdev =3D alloc_percpu(struct percpu_swap_d= evice); > + if (!swap_priority->pcpu_swapdev) { > + kvfree(swap_priority); > + return NULL; > + } > + > /* > * Pre-allocates pnode array up to nr_swapfiles at init. > * Individual pnodes are assigned on swapon, but not freed > @@ -326,10 +363,34 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *= memcg, > unsigned long offset; > int node; > > - /* > - * TODO: Per-cpu swap cluster cache can't be used directly > - * as cgroup-specific priorities may select different devices. > - */ > + rcu_read_lock(); > + if (!(swap_priority =3D get_effective_swap_cgroup_priority_rcu(me= mcg))) { > + rcu_read_unlock(); > + return false; > + } > + > + /* Fast path */ > + si =3D this_cpu_read(swap_priority->pcpu_swapdev->si[order]); > + if (si && get_swap_device_info(si)) { > + offset =3D cluster_alloc_swap_entry(si, order, SWAP_HAS_C= ACHE); > + if (offset) { > + *entry =3D swp_entry(si->type, offset); > + /* > + * Protected by 'percpu_swap_cluster' local_lock; > + * CPU migration is disabled during this operatio= n. > + */ > + this_cpu_write(swap_priority->pcpu_swapdev->si[or= der], > + si); > + put_swap_device(si); > + rcu_read_unlock(); > + > + return true; > + } > + put_swap_device(si); > + } > + rcu_read_unlock(); > + > + /* Slow path */ Hi Youngjun One thing I noticed after a quick glance is that this swap_alloc_cgroup_priority is bloated and it is doing similar things as folio_alloc_swap. I imagined that we can just have a struct (eg. let's call it struct swap_percpu_info / pi) as a closure of what the allocator needs, it contains the plist and fast path device. With slight changes to folio_alloc_swap, it can respect either the cgroup's pi or global pi. (might be a horrible name though, feel free to change it) For example first thing swap_alloc_fast do will be: `struct swap_percpu_info *pi =3D folio_swap_percpu_info(folio);` folio_swap_percpu_info returns the cgroup's swap_percpu_info or the global = one. swap_alloc_slow can do a similar thing, it then can just use pi->plist and pi->pcpu_swapdev, (cluster info will be in si) ignoring all the cgroup differences. Also it is better to check your patches with ./scripts/checkpatch.pl, I'm seeing some styling issues. I'll check your other patches too later this week, thanks for the update on this idea. > spin_lock(&swap_avail_lock); > node =3D numa_node_id(); > > @@ -350,6 +411,14 @@ bool swap_alloc_cgroup_priority(struct mem_cgroup *m= emcg, > if (get_swap_device_info(si)) { > offset =3D cluster_alloc_swap_entry(si, order, > SWAP_HAS_CACHE)= ; > + /* > + * Protected by 'percpu_swap_cluster' local_lock; > + * CPU migration is disabled during this operatio= n. > + */ > + if (memcg->swap_priority =3D=3D swap_priority) > + this_cpu_write( > + swap_priority->pcpu_swapdev->si[o= rder], > + si); > put_swap_device(si); > if (offset) { > *entry =3D swp_entry(si->type, offset); > @@ -687,6 +756,21 @@ static int __apply_swap_cgroup_priority( > return 0; > } > > +static int init_swap_cgroup_priority_pcpu_swapdev_cache( > + struct swap_cgroup_priority *swap_priority) > +{ > + int cpu; > + > + for_each_possible_cpu(cpu) { > + struct percpu_swap_device *pcp_swap_dev =3D > + per_cpu_ptr(swap_priority->pcpu_swapdev, cpu); > + for (int i =3D 0; i < SWAP_NR_ORDERS; i++) > + pcp_swap_dev->si[i] =3D NULL; > + } > + > + return 0; > +} > + > /* > * If this is the top-level swap_cgroup_priority, propagation is needed. > * We traverse the 'mem_cgroup_tree' using 'for_each_mem_cgroup_tree'. > @@ -795,6 +879,8 @@ int apply_swap_cgroup_priority(struct mem_cgroup *mem= cg, u64 id, int prio) > for_each_node(nid) > plist_head_init(&swap_priority->plist[nid]); > > + init_swap_cgroup_priority_pcpu_swapdev_cache(swap_priority); > + > prio_set: > spin_lock(&swap_lock); > spin_lock(&swap_avail_lock); > @@ -843,6 +929,23 @@ int apply_swap_cgroup_priority(struct mem_cgroup *me= mcg, u64 id, int prio) > > spin_unlock(&swap_avail_lock); > spin_unlock(&swap_lock); > + /* > + * XXX: We cannot fully synchronize with swap_alloc_cgroup_priori= ty > + * when updating the next si. > + * Still, we ensure that flush operations inside swap_priority > + * are performed as reliably as possible. > + */ > + if (id !=3D DEFAULT_ID && > + swap_priority =3D=3D swap_priority->effective && !new) { > + int cpu; > + struct swap_info_struct **pcp_si; > + for_each_possible_cpu(cpu) { > + pcp_si =3D per_cpu_ptr( > + swap_priority->pcpu_swapdev->si, cpu); > + for (int i =3D 0; i < SWAP_NR_ORDERS; i++) > + pcp_si[i] =3D NULL; > + } > + } > mutex_unlock(&swap_cgroup_priority_inherit_lck); > return 0; > > @@ -886,3 +989,48 @@ void delete_swap_cgroup_priority(struct mem_cgroup *= memcg) > spin_unlock(&swap_avail_lock); > mutex_unlock(&swap_cgroup_priority_inherit_lck); > } > + > +void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *= si) > +{ > + int cpu, i; > + struct swap_info_struct **pcp_si; > + struct swap_cgroup_priority *swap_priority; > + > + rcu_read_lock(); > + list_for_each_entry_rcu(swap_priority, > + &swap_cgroup_priority_list, link) { > + for_each_possible_cpu(cpu) { > + pcp_si =3D per_cpu_ptr( > + swap_priority->pcpu_swapdev->si, = cpu); > + > + for (i =3D 0; i < SWAP_NR_ORDERS; i++) > + cmpxchg(&pcp_si[i], si, NULL); > + } > + } > + rcu_read_unlock(); > +} > + > +bool alloc_percpu_swap_cluster(struct swap_info_struct *si) > +{ > + si->percpu_cluster =3D alloc_percpu(struct percpu_cluster); > + if (!si->percpu_cluster) > + return false; > + > + int cpu; > + int i; > + for_each_possible_cpu(cpu) { > + struct percpu_cluster *cluster; > + > + cluster =3D per_cpu_ptr(si->percpu_cluster, cpu); > + for (i =3D 0; i < SWAP_NR_ORDERS; i++) > + cluster->next[i] =3D SWAP_ENTRY_INVALID; > + } > + > + return true; > +} > + > +void free_percpu_swap_cluster(struct swap_info_struct *si) > +{ > + free_percpu(si->percpu_cluster); > + si->percpu_cluster =3D NULL; > +} > diff --git a/mm/swap_cgroup_priority.h b/mm/swap_cgroup_priority.h > index 5d16b63d12e0..815822ebd0d1 100644 > --- a/mm/swap_cgroup_priority.h > +++ b/mm/swap_cgroup_priority.h > @@ -47,6 +47,22 @@ struct swap_cgroup_priority *inherit_swap_cgroup_prior= ity( > bool swap_alloc_cgroup_priority(struct mem_cgroup *memcg, swp_entry_t *e= ntry, > int order); > void delete_swap_cgroup_priority(struct mem_cgroup *memcg); > +void flush_swap_cgroup_priority_percpu_swapdev(struct swap_info_struct *= si); > + > +bool alloc_percpu_swap_cluster(struct swap_info_struct *si); > +void free_percpu_swap_cluster(struct swap_info_struct *si); > +static inline void write_percpu_swap_cluster_next(struct swap_info_struc= t *si, > + int order, > + unsigned int next) > +{ > + this_cpu_write(si->percpu_cluster->next[order], next); > +} > + > +static inline unsigned int read_percpu_swap_cluster_next( > + struct swap_info_struct *si, int order) > +{ > + return __this_cpu_read(si->percpu_cluster->next[order]); > +} > #else > int swap_node(struct swap_info_struct *si); > unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int = order, > @@ -85,5 +101,28 @@ static inline bool swap_alloc_cgroup_priority(struct = mem_cgroup *memcg, > static inline void delete_swap_cgroup_priority(struct mem_cgroup *memcg) > { > } > +static inline void flush_swap_cgroup_priority_percpu_swapdev( > + struct swap_info_struct *si) > +{ > +} > +static inline bool alloc_percpu_swap_cluster(struct swap_info_struct *si= ) > +{ > + return true; > +} > +static inline void free_percpu_swap_cluster(struct swap_info_struct *si) > +{ > +} > +static inline void write_percpu_swap_cluster_next(struct swap_info_struc= t *si, > + int order, > + unsigned int next) > +{ > + return; > +} > + > +static inline unsigned int read_percpu_swap_cluster_next( > + struct swap_info_struct *si, int order) > +{ > + return SWAP_ENTRY_INVALID; > +} > #endif > #endif > diff --git a/mm/swapfile.c b/mm/swapfile.c > index bfd0532ad250..6a5ac9962e9f 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -817,12 +817,15 @@ static unsigned int alloc_swap_scan_cluster(struct = swap_info_struct *si, > out: > relocate_cluster(si, ci); > unlock_cluster(ci); > + > if (si->flags & SWP_SOLIDSTATE) { > this_cpu_write(percpu_swap_cluster.offset[order], next); Why not just remove the `percpu_swap_cluster.offset` and just share si->percpu_cluster among all cgroups (including root cgroup)? Otherwise, eg. if rootcg's pcpu cluster and one cgroup's pcpu cluster are pointing to one same cluster, they might be in contention on allocation of different order, or even in the same order the performance might not be good as multiple CPUs will race with each other. It will be easier to implement too. > this_cpu_write(percpu_swap_cluster.si[order], si); > + write_percpu_swap_cluster_next(si, order, next); > } else { > si->global_cluster->next[order] =3D next; > } > + > return found; > } > > @@ -892,26 +895,29 @@ unsigned long cluster_alloc_swap_entry(struct swap_= info_struct *si, int order, > if (order && !(si->flags & SWP_BLKDEV)) > return 0; > > - if (!(si->flags & SWP_SOLIDSTATE)) { > + if (si->flags & SWP_SOLIDSTATE) { > + offset =3D read_percpu_swap_cluster_next(si, order); > + } else { > /* Serialize HDD SWAP allocation for each device. */ > spin_lock(&si->global_cluster_lock); > offset =3D si->global_cluster->next[order]; > - if (offset =3D=3D SWAP_ENTRY_INVALID) > - goto new_cluster; > + } > > - ci =3D lock_cluster(si, offset); > - /* Cluster could have been used by another order */ > - if (cluster_is_usable(ci, order)) { > - if (cluster_is_empty(ci)) > - offset =3D cluster_offset(si, ci); > - found =3D alloc_swap_scan_cluster(si, ci, offset, > - order, usage); > - } else { > - unlock_cluster(ci); > - } > - if (found) > - goto done; > + if (offset =3D=3D SWAP_ENTRY_INVALID) > + goto new_cluster; > + > + ci =3D lock_cluster(si, offset); > + /* Cluster could have been used by another order */ > + if (cluster_is_usable(ci, order)) { > + if (cluster_is_empty(ci)) > + offset =3D cluster_offset(si, ci); > + found =3D alloc_swap_scan_cluster(si, ci, offset, > + order, usage); > + } else { > + unlock_cluster(ci); > } > + if (found) > + goto done; > > new_cluster: > ci =3D isolate_lock_cluster(si, &si->free_clusters); > @@ -991,6 +997,7 @@ unsigned long cluster_alloc_swap_entry(struct swap_in= fo_struct *si, int order, > done: > if (!(si->flags & SWP_SOLIDSTATE)) > spin_unlock(&si->global_cluster_lock); > + > return found; > } > > @@ -2674,6 +2681,8 @@ static void flush_percpu_swap_cluster(struct swap_i= nfo_struct *si) > for (i =3D 0; i < SWAP_NR_ORDERS; i++) > cmpxchg(&pcp_si[i], si, NULL); > } > + > + flush_swap_cgroup_priority_percpu_swapdev(si); > } > > > @@ -2802,6 +2811,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, speci= alfile) > arch_swap_invalidate_area(p->type); > zswap_swapoff(p->type); > mutex_unlock(&swapon_mutex); > + free_percpu_swap_cluster(p); > kfree(p->global_cluster); > p->global_cluster =3D NULL; > vfree(swap_map); > @@ -2900,7 +2910,6 @@ static void swap_stop(struct seq_file *swap, void *= v) > mutex_unlock(&swapon_mutex); > } > > - > #ifndef CONFIG_SWAP_CGROUP_PRIORITY > static int swap_show(struct seq_file *swap, void *v) > { > @@ -3239,7 +3248,10 @@ static struct swap_cluster_info *setup_clusters(st= ruct swap_info_struct *si, > for (i =3D 0; i < nr_clusters; i++) > spin_lock_init(&cluster_info[i].lock); > > - if (!(si->flags & SWP_SOLIDSTATE)) { > + if (si->flags & SWP_SOLIDSTATE) { > + if (!alloc_percpu_swap_cluster(si)) > + goto err_free; > + } else { > si->global_cluster =3D kmalloc(sizeof(*si->global_cluster= ), > GFP_KERNEL); > if (!si->global_cluster) > @@ -3532,6 +3544,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specia= lfile, int, swap_flags) > bad_swap_unlock_inode: > inode_unlock(inode); > bad_swap: > + free_percpu_swap_cluster(si); > kfree(si->global_cluster); > si->global_cluster =3D NULL; > inode =3D NULL; > -- > 2.34.1 > >