[RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter
@ 2025-05-19 13:11 Chen Ridong
  2025-05-19 13:11 ` [RFC next v2 1/2] ucounts: free ucount only count and rlimit are zero Chen Ridong
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Chen Ridong @ 2025-05-19 13:11 UTC (permalink / raw)
  To: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	bigeasy, paulmck, chenridong, roman.gushchin, brauner, pmladek,
	geert, mingo, rrangel, francesco, kpsingh, guoweikang.kernel,
	link, viro, neil, nichen, tglx, frederic, peterz, oleg,
	joel.granados, linux, avagin, legion
  Cc: linux-kernel, linux-mm, lujialin4

From: Chen Ridong <chenridong@huawei.com>

The will-it-scale test case signal1 [1] has been observed. and the test
results reveal that the signal sending system call lacks linearity.
To further investigate this issue, we initiated a series of tests by
launching varying numbers of dockers and closely monitored the throughput
of each individual docker. The detailed test outcomes are presented as
follows:

	| Dockers     |1      |4      |8      |16     |32     |64     |
	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |

The data clearly demonstrates a discernible trend: as the quantity of
dockers increases, the throughput per container progressively declines.
In-depth analysis has identified the root cause of this performance
degradation. The ucouts module conducts statistics on rlimit, which
involves a significant number of atomic operations. These atomic
operations, when acting on the same variable, trigger a substantial number
of cache misses or remote accesses, ultimately resulting in a drop in
performance.

This patch set addresses scalability issues in the ucounts rlimit by
replacing atomic rlimit counters with percpu_counter, which distributes
counts across CPU cores to reduce cache contention under heavy load.

Patch 1 modifies thate ucount can be freed until both the refcount and
rlimit are fully released, minimizing redundant summations. Patch 2 turns
the atomic rlimit to percpu_counter, which is suggested by Andrew.

[1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/

---
v2: use percpu_counter intead of cache rlimit.

v1: https://lore.kernel.org/lkml/20250509072054.148257-1-chenridong@huaweicloud.com/

Chen Ridong (2):
  ucounts: free ucount only count and rlimit are zero
  ucounts: turn the atomic rlimit to percpu_counter

 include/linux/user_namespace.h |  17 +++-
 init/main.c                    |   1 +
 ipc/mqueue.c                   |   6 +-
 kernel/signal.c                |   8 +-
 kernel/ucount.c                | 169 +++++++++++++++++++++++----------
 mm/mlock.c                     |   5 +-
 6 files changed, 138 insertions(+), 68 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC next v2 1/2] ucounts: free ucount only count and rlimit are zero
  2025-05-19 13:11 [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter Chen Ridong
@ 2025-05-19 13:11 ` Chen Ridong
  2025-05-19 13:11 ` [RFC next v2 2/2] ucounts: turn the atomic rlimit to percpu_counter Chen Ridong
  2025-05-19 19:32 ` [RFC next v2 0/2] " Jann Horn
  2 siblings, 0 replies; 8+ messages in thread
From: Chen Ridong @ 2025-05-19 13:11 UTC (permalink / raw)
  To: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	bigeasy, paulmck, chenridong, roman.gushchin, brauner, pmladek,
	geert, mingo, rrangel, francesco, kpsingh, guoweikang.kernel,
	link, viro, neil, nichen, tglx, frederic, peterz, oleg,
	joel.granados, linux, avagin, legion
  Cc: linux-kernel, linux-mm, lujialin4

From: Chen Ridong <chenridong@huawei.com>

After the commit fda31c50292a ("signal: avoid double atomic counter
increments for user accounting") and the commit 15bc01effefe ("ucounts:
Fix signal ucount refcounting"), the reference counting mechanism for
ucounts has the following behavior. The reference count is incremented
when the first pending signal pins to the ucounts, and it is decremented
when the last pending signal is dequeued. This implies that as long as
there are any pending signals pinned to the ucounts, the ucounts cannot
be freed.

To address the scalability issue, the next patch will mention, the
ucounts.rlimits will be converted to percpu_counter. However, summing up
the percpu counters is expensive. To overcome this, this patch modifies
the conditions for freeing ucounts. Instead of complex checks regarding
whether a pending signal is the first or the last one, the ucounts can now
be freed only when both the refcount and the rlimits are zero.
This change not only simplifies the logic but also reduces the number of
atomic operations.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 include/linux/user_namespace.h |  1 +
 kernel/ucount.c                | 75 ++++++++++++++++++++++++++--------
 2 files changed, 59 insertions(+), 17 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index a0bb6d012137..6e2229ea4673 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -122,6 +122,7 @@ struct ucounts {
 	kuid_t uid;
 	struct rcu_head rcu;
 	rcuref_t count;
+	atomic_long_t freed;
 	atomic_long_t ucount[UCOUNT_COUNTS];
 	atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS];
 };
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8686e329b8f2..125471af7d59 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -185,18 +185,61 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
 	return new;
 }
 
-void put_ucounts(struct ucounts *ucounts)
+/*
+ * Whether all the rlimits are zero.
+ * For now, only UCOUNT_RLIMIT_SIGPENDING is considered.
+ * Other rlimit can be added.
+ */
+static bool rlimits_are_zero(struct ucounts *ucounts)
+{
+	int rtypes[] = { UCOUNT_RLIMIT_SIGPENDING };
+	int rtype;
+
+	for (int i = 0; i < sizeof(rtypes)/sizeof(int); ++i) {
+		rtype = rtypes[i];
+		if (atomic_long_read(&ucounts->rlimit[rtype]) > 0)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * Ucounts can be freed only when the ucount->count is released
+ * and the rlimits are zero.
+ * The caller should hold rcu_read_lock();
+ */
+static bool ucounts_can_be_freed(struct ucounts *ucounts)
+{
+	if (rcuref_read(&ucounts->count) > 0)
+		return false;
+	if (!rlimits_are_zero(ucounts))
+		return false;
+	/* Prevent double free */
+	return atomic_long_cmpxchg(&ucounts->freed, 0, 1) == 0;
+}
+
+static void free_ucounts(struct ucounts *ucounts)
 {
 	unsigned long flags;
 
-	if (rcuref_put(&ucounts->count)) {
-		spin_lock_irqsave(&ucounts_lock, flags);
-		hlist_nulls_del_rcu(&ucounts->node);
-		spin_unlock_irqrestore(&ucounts_lock, flags);
+	spin_lock_irqsave(&ucounts_lock, flags);
+	hlist_nulls_del_rcu(&ucounts->node);
+	spin_unlock_irqrestore(&ucounts_lock, flags);
+
+	put_user_ns(ucounts->ns);
+	kfree_rcu(ucounts, rcu);
+}
 
-		put_user_ns(ucounts->ns);
-		kfree_rcu(ucounts, rcu);
+void put_ucounts(struct ucounts *ucounts)
+{
+	rcu_read_lock();
+	if (rcuref_put(&ucounts->count) &&
+	    ucounts_can_be_freed(ucounts)) {
+		rcu_read_unlock();
+		free_ucounts(ucounts);
+		return;
 	}
+	rcu_read_unlock();
 }
 
 static inline bool atomic_long_inc_below(atomic_long_t *v, int u)
@@ -281,11 +324,17 @@ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
 {
 	struct ucounts *iter, *next;
 	for (iter = ucounts; iter != last; iter = next) {
+		bool to_free;
+
+		rcu_read_lock();
 		long dec = atomic_long_sub_return(1, &iter->rlimit[type]);
 		WARN_ON_ONCE(dec < 0);
 		next = iter->ns->ucounts;
-		if (dec == 0)
-			put_ucounts(iter);
+		to_free = ucounts_can_be_freed(iter);
+		rcu_read_unlock();
+		/* If ucounts->count is zero and the rlimits are zero, free ucounts */
+		if (to_free)
+			free_ucounts(iter);
 	}
 }
 
@@ -310,14 +359,6 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
 			ret = new;
 		if (!override_rlimit)
 			max = get_userns_rlimit_max(iter->ns, type);
-		/*
-		 * Grab an extra ucount reference for the caller when
-		 * the rlimit count was previously 0.
-		 */
-		if (new != 1)
-			continue;
-		if (!get_ucounts(iter))
-			goto dec_unwind;
 	}
 	return ret;
 dec_unwind:
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC next v2 2/2] ucounts: turn the atomic rlimit to percpu_counter
  2025-05-19 13:11 [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter Chen Ridong
  2025-05-19 13:11 ` [RFC next v2 1/2] ucounts: free ucount only count and rlimit are zero Chen Ridong
@ 2025-05-19 13:11 ` Chen Ridong
  2025-05-19 19:32 ` [RFC next v2 0/2] " Jann Horn
  2 siblings, 0 replies; 8+ messages in thread
From: Chen Ridong @ 2025-05-19 13:11 UTC (permalink / raw)
  To: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, jannh, pfalcato,
	bigeasy, paulmck, chenridong, roman.gushchin, brauner, pmladek,
	geert, mingo, rrangel, francesco, kpsingh, guoweikang.kernel,
	link, viro, neil, nichen, tglx, frederic, peterz, oleg,
	joel.granados, linux, avagin, legion
  Cc: linux-kernel, linux-mm, lujialin4

From: Chen Ridong <chenridong@huawei.com>

The will-it-scale test case signal1 [1] has been observed. and the test
results reveal that the signal sending system call lacks linearity.
To further investigate this issue, we initiated a series of tests by
launching varying numbers of dockers and closely monitored the throughput
of each individual docker. The detailed test outcomes are presented as
follows:

  | Dockers 	|1	|4	|8	|16	|32	|64	|
  | Throughput 	|380068	|353204	|308948	|306453	|180659	|129152	|

The data clearly demonstrates a discernible trend: as the quantity of
dockers increases, the throughput per container progressively declines.
In-depth analysis has identified the root cause of this performance
degradation. The ucounts module conducts statistics on rlimit, which
involves a significant number of atomic operations. These atomic
operations, when acting on the same variable, trigger a substantial number
of cache misses or remote accesses, ultimately resulting in a drop in
performance.

To address the above issues, this patch converts the atomic rlimit to a
percpu_counter. After the optimization, the performance data is shown
below, demonstrating that the throughput no longer declines as the number
of Docker containers increases:

  | Dockers 	|1	|4	|8	|16	|32	|64	|
  | Throughput 	|374737	|376377	|374814	|379284	|374950	|377509	|

[1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 include/linux/user_namespace.h | 16 ++++--
 init/main.c                    |  1 +
 ipc/mqueue.c                   |  6 +--
 kernel/signal.c                |  8 +--
 kernel/ucount.c                | 98 ++++++++++++++++++++++------------
 mm/mlock.c                     |  5 +-
 6 files changed, 81 insertions(+), 53 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 6e2229ea4673..0d1251e1f9ea 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -12,6 +12,7 @@
 #include <linux/rwsem.h>
 #include <linux/sysctl.h>
 #include <linux/err.h>
+#include <linux/percpu_counter.h>
 
 #define UID_GID_MAP_MAX_BASE_EXTENTS 5
 #define UID_GID_MAP_MAX_EXTENTS 340
@@ -124,7 +125,7 @@ struct ucounts {
 	rcuref_t count;
 	atomic_long_t freed;
 	atomic_long_t ucount[UCOUNT_COUNTS];
-	atomic_long_t rlimit[UCOUNT_RLIMIT_COUNTS];
+	struct percpu_counter rlimit[UCOUNT_RLIMIT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
@@ -136,6 +137,7 @@ struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_ty
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
 void put_ucounts(struct ucounts *ucounts);
+void __init ucounts_init(void);
 
 static inline struct ucounts * __must_check get_ucounts(struct ucounts *ucounts)
 {
@@ -146,13 +148,17 @@ static inline struct ucounts * __must_check get_ucounts(struct ucounts *ucounts)
 
 static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type type)
 {
-	return atomic_long_read(&ucounts->rlimit[type]);
+	return percpu_counter_sum(&ucounts->rlimit[type]);
 }
 
-long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
-bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
+bool inc_rlimit_ucounts_limit(struct ucounts *ucounts, enum rlimit_type type, long v, long limit);
+static inline bool inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v)
+{
+	return inc_rlimit_ucounts_limit(ucounts, type, v, LONG_MAX);
+}
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
-			    bool override_rlimit);
+			    bool override_rlimit, long limit);
 void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
 bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
 
diff --git a/init/main.c b/init/main.c
index 7f0a2a3dbd29..1168c0c453ff 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1071,6 +1071,7 @@ void start_kernel(void)
 		efi_enter_virtual_mode();
 #endif
 	thread_stack_cache_init();
+	ucounts_init();
 	cred_init();
 	fork_init();
 	proc_caches_init();
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 35b4f8659904..e4bd211900ab 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -371,11 +371,9 @@ static struct inode *mqueue_get_inode(struct super_block *sb,
 		mq_bytes += mq_treesize;
 		info->ucounts = get_ucounts(current_ucounts());
 		if (info->ucounts) {
-			long msgqueue;
-
 			spin_lock(&mq_lock);
-			msgqueue = inc_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
-			if (msgqueue == LONG_MAX || msgqueue > rlimit(RLIMIT_MSGQUEUE)) {
+			if (!inc_rlimit_ucounts_limit(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE,
+							mq_bytes, rlimit(RLIMIT_MSGQUEUE))) {
 				dec_rlimit_ucounts(info->ucounts, UCOUNT_RLIMIT_MSGQUEUE, mq_bytes);
 				spin_unlock(&mq_lock);
 				put_ucounts(info->ucounts);
diff --git a/kernel/signal.c b/kernel/signal.c
index f8859faa26c5..2b6ed2168db6 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -416,13 +416,9 @@ static struct ucounts *sig_get_ucounts(struct task_struct *t, int sig,
 	rcu_read_lock();
 	ucounts = task_ucounts(t);
 	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
-					    override_rlimit);
+					    override_rlimit, task_rlimit(t, RLIMIT_SIGPENDING));
 	rcu_read_unlock();
-	if (!sigpending)
-		return NULL;
-
-	if (unlikely(!override_rlimit && sigpending > task_rlimit(t, RLIMIT_SIGPENDING))) {
-		dec_rlimit_put_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING);
+	if (!sigpending) {
 		print_dropped_signal(sig);
 		return NULL;
 	}
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 125471af7d59..a856f3d4a9a1 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -158,6 +158,7 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
 {
 	struct hlist_nulls_head *hashent = ucounts_hashentry(ns, uid);
 	struct ucounts *ucounts, *new;
+	int i = 0, j = 0;
 
 	ucounts = find_ucounts(ns, uid, hashent);
 	if (ucounts)
@@ -170,11 +171,16 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
 	new->ns = ns;
 	new->uid = uid;
 	rcuref_init(&new->count, 1);
-
+	for (i = 0; i < UCOUNT_RLIMIT_COUNTS; ++i) {
+		if (percpu_counter_init(&new->rlimit[i], 0, GFP_KERNEL))
+			goto failed;
+	}
 	spin_lock_irq(&ucounts_lock);
 	ucounts = find_ucounts(ns, uid, hashent);
 	if (ucounts) {
 		spin_unlock_irq(&ucounts_lock);
+		for (j = 0; j < UCOUNT_RLIMIT_COUNTS; ++j)
+			percpu_counter_destroy(&new->rlimit[j]);
 		kfree(new);
 		return ucounts;
 	}
@@ -183,6 +189,12 @@ struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid)
 	get_user_ns(new->ns);
 	spin_unlock_irq(&ucounts_lock);
 	return new;
+
+failed:
+	for (j = 0; i > 0 && j < i - 1; ++j)
+		percpu_counter_destroy(&new->rlimit[j]);
+	kfree(new);
+	return NULL;
 }
 
 /*
@@ -197,7 +209,7 @@ static bool rlimits_are_zero(struct ucounts *ucounts)
 
 	for (int i = 0; i < sizeof(rtypes)/sizeof(int); ++i) {
 		rtype = rtypes[i];
-		if (atomic_long_read(&ucounts->rlimit[rtype]) > 0)
+		if (get_rlimit_value(ucounts, rtype) > 0)
 			return false;
 	}
 	return true;
@@ -225,7 +237,8 @@ static void free_ucounts(struct ucounts *ucounts)
 	spin_lock_irqsave(&ucounts_lock, flags);
 	hlist_nulls_del_rcu(&ucounts->node);
 	spin_unlock_irqrestore(&ucounts_lock, flags);
-
+	for (int i = 0; i < UCOUNT_RLIMIT_COUNTS; ++i)
+		percpu_counter_destroy(&ucounts->rlimit[i]);
 	put_user_ns(ucounts->ns);
 	kfree_rcu(ucounts, rcu);
 }
@@ -289,36 +302,35 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type)
 	put_ucounts(ucounts);
 }
 
-long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v)
+bool inc_rlimit_ucounts_limit(struct ucounts *ucounts, enum rlimit_type type,
+					long v, long limit)
 {
 	struct ucounts *iter;
 	long max = LONG_MAX;
-	long ret = 0;
+	bool good = true;
 
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		long new = atomic_long_add_return(v, &iter->rlimit[type]);
-		if (new < 0 || new > max)
-			ret = LONG_MAX;
-		else if (iter == ucounts)
-			ret = new;
+		max = min(limit, max);
+		if (!percpu_counter_limited_add(&iter->rlimit[type], max, v))
+			good = false;
+
 		max = get_userns_rlimit_max(iter->ns, type);
 	}
-	return ret;
+	return good;
 }
 
-bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v)
+void dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v)
 {
 	struct ucounts *iter;
-	long new = -1; /* Silence compiler warning */
-	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		long dec = atomic_long_sub_return(v, &iter->rlimit[type]);
-		WARN_ON_ONCE(dec < 0);
-		if (iter == ucounts)
-			new = dec;
-	}
-	return (new == 0);
+
+	for (iter = ucounts; iter; iter = iter->ns->ucounts)
+		percpu_counter_sub(&iter->rlimit[type], v);
 }
 
+/*
+ * The inc_rlimit_get_ucounts does not grab the refcount.
+ * The rlimit_release should be called very time the rlimit is decremented.
+ */
 static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
 				struct ucounts *last, enum rlimit_type type)
 {
@@ -327,8 +339,7 @@ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
 		bool to_free;
 
 		rcu_read_lock();
-		long dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-		WARN_ON_ONCE(dec < 0);
+		percpu_counter_sub(&iter->rlimit[type], 1);
 		next = iter->ns->ucounts;
 		to_free = ucounts_can_be_freed(iter);
 		rcu_read_unlock();
@@ -343,29 +354,37 @@ void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
 	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
 }
 
+/*
+ * Though this function does not grab the refcount, it is promised that the
+ * ucounts will not be freed as long as there have any rlimit pins to it.
+ * Caller must hold a reference to ucounts or under rcu_read_lock().
+ *
+ * Return 1 if increments successful, otherwise return 0.
+ */
 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
-			    bool override_rlimit)
+			    bool override_rlimit, long limit)
 {
-	/* Caller must hold a reference to ucounts */
 	struct ucounts *iter;
 	long max = LONG_MAX;
-	long dec, ret = 0;
+	long ret = 0;
+
+	if (override_rlimit)
+		limit = LONG_MAX;
 
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		long new = atomic_long_add_return(1, &iter->rlimit[type]);
-		if (new < 0 || new > max)
+		/* Can not exceed the limit(inputed) or the ns->rlimit_max */
+		max = min(limit, max);
+
+		if (!percpu_counter_limited_add(&iter->rlimit[type], max, 1))
 			goto dec_unwind;
-		if (iter == ucounts)
-			ret = new;
+
 		if (!override_rlimit)
 			max = get_userns_rlimit_max(iter->ns, type);
 	}
-	return ret;
+	return 1;
 dec_unwind:
-	dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-	WARN_ON_ONCE(dec < 0);
 	do_dec_rlimit_put_ucounts(ucounts, iter, type);
-	return 0;
+	return ret;
 }
 
 bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long rlimit)
@@ -374,15 +393,23 @@ bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigne
 	long max = rlimit;
 	if (rlimit > LONG_MAX)
 		max = LONG_MAX;
+
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		long val = get_rlimit_value(iter, type);
-		if (val < 0 || val > max)
+		/* iter->rlimit[type] > max return 1 */
+		if (percpu_counter_compare(&iter->rlimit[type], max) > 0)
 			return true;
+
 		max = get_userns_rlimit_max(iter->ns, type);
 	}
 	return false;
 }
 
+void __init ucounts_init(void)
+{
+	for (int i = 0; i < UCOUNT_RLIMIT_COUNTS; ++i)
+		percpu_counter_init(&init_ucounts.rlimit[i], 0, GFP_KERNEL);
+}
+
 static __init int user_namespace_sysctl_init(void)
 {
 #ifdef CONFIG_SYSCTL
@@ -398,6 +425,7 @@ static __init int user_namespace_sysctl_init(void)
 	BUG_ON(!user_header);
 	BUG_ON(!setup_userns_sysctls(&init_user_ns));
 #endif
+
 	hlist_add_ucounts(&init_ucounts);
 	inc_rlimit_ucounts(&init_ucounts, UCOUNT_RLIMIT_NPROC, 1);
 	return 0;
diff --git a/mm/mlock.c b/mm/mlock.c
index 3cb72b579ffd..20f3b62b3ec0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -793,7 +793,6 @@ static DEFINE_SPINLOCK(shmlock_user_lock);
 int user_shm_lock(size_t size, struct ucounts *ucounts)
 {
 	unsigned long lock_limit, locked;
-	long memlock;
 	int allowed = 0;
 
 	locked = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;
@@ -801,9 +800,9 @@ int user_shm_lock(size_t size, struct ucounts *ucounts)
 	if (lock_limit != RLIM_INFINITY)
 		lock_limit >>= PAGE_SHIFT;
 	spin_lock(&shmlock_user_lock);
-	memlock = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 
-	if ((memlock == LONG_MAX || memlock > lock_limit) && !capable(CAP_IPC_LOCK)) {
+	if (!inc_rlimit_ucounts_limit(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked, lock_limit)
+		&& !capable(CAP_IPC_LOCK)) {
 		dec_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_MEMLOCK, locked);
 		goto out;
 	}
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter
  2025-05-19 13:11 [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter Chen Ridong
  2025-05-19 13:11 ` [RFC next v2 1/2] ucounts: free ucount only count and rlimit are zero Chen Ridong
  2025-05-19 13:11 ` [RFC next v2 2/2] ucounts: turn the atomic rlimit to percpu_counter Chen Ridong
@ 2025-05-19 19:32 ` Jann Horn
  2025-05-19 21:01   ` Alexey Gladkov
  2025-05-21  1:41   ` Chen Ridong
  2 siblings, 2 replies; 8+ messages in thread
From: Jann Horn @ 2025-05-19 19:32 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, pfalcato, bigeasy,
	paulmck, chenridong, roman.gushchin, brauner, pmladek, geert,
	mingo, rrangel, francesco, kpsingh, guoweikang.kernel, link,
	viro, neil, nichen, tglx, frederic, peterz, oleg, joel.granados,
	linux, avagin, legion, linux-kernel, linux-mm, lujialin4,
	Serge E. Hallyn, David Howells

On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> From: Chen Ridong <chenridong@huawei.com>
>
> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.
> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
>
>         | Dockers     |1      |4      |8      |16     |32     |64     |
>         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.

But is that actually a problem? Do you have real workloads that
concurrently send so many signals, or create inotify watches so
quickly, that this is has an actual performance impact?

> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.

You're probably running into the namespace-associated ucounts here? So
the issue is probably that Docker creates all your containers with the
same owner UID (EUID at namespace creation), causing them all to
account towards a single ucount, while normally outside of containers,
each RUID has its own ucount instance?

Sharing of rlimits between containers is probably normally undesirable
even without the cacheline bouncing, because it means that too much
resource usage in one container can cause resource allocations in
another container to fail... so I think the real problem here is at a
higher level, in the namespace setup code. Maybe root should be able
to create a namespace that doesn't inherit ucount limits of its owner
UID, or something like that...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter
  2025-05-19 19:32 ` [RFC next v2 0/2] " Jann Horn
@ 2025-05-19 21:01   ` Alexey Gladkov
  2025-05-19 21:24     ` Jann Horn
  2025-05-21  1:41   ` Chen Ridong
  1 sibling, 1 reply; 8+ messages in thread
From: Alexey Gladkov @ 2025-05-19 21:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Chen Ridong, akpm, Liam.Howlett, lorenzo.stoakes, vbabka,
	pfalcato, bigeasy, paulmck, chenridong, roman.gushchin, brauner,
	pmladek, geert, mingo, rrangel, francesco, kpsingh,
	guoweikang.kernel, link, viro, neil, nichen, tglx, frederic,
	peterz, oleg, joel.granados, linux, avagin, linux-kernel,
	linux-mm, lujialin4, Serge E. Hallyn, David Howells

On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote:
> On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> > From: Chen Ridong <chenridong@huawei.com>
> >
> > The will-it-scale test case signal1 [1] has been observed. and the test
> > results reveal that the signal sending system call lacks linearity.
> > To further investigate this issue, we initiated a series of tests by
> > launching varying numbers of dockers and closely monitored the throughput
> > of each individual docker. The detailed test outcomes are presented as
> > follows:
> >
> >         | Dockers     |1      |4      |8      |16     |32     |64     |
> >         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> >
> > The data clearly demonstrates a discernible trend: as the quantity of
> > dockers increases, the throughput per container progressively declines.
> 
> But is that actually a problem? Do you have real workloads that
> concurrently send so many signals, or create inotify watches so
> quickly, that this is has an actual performance impact?
> 
> > In-depth analysis has identified the root cause of this performance
> > degradation. The ucouts module conducts statistics on rlimit, which
> > involves a significant number of atomic operations. These atomic
> > operations, when acting on the same variable, trigger a substantial number
> > of cache misses or remote accesses, ultimately resulting in a drop in
> > performance.
> 
> You're probably running into the namespace-associated ucounts here? So
> the issue is probably that Docker creates all your containers with the
> same owner UID (EUID at namespace creation), causing them all to
> account towards a single ucount, while normally outside of containers,
> each RUID has its own ucount instance?
> 
> Sharing of rlimits between containers is probably normally undesirable
> even without the cacheline bouncing, because it means that too much
> resource usage in one container can cause resource allocations in
> another container to fail... so I think the real problem here is at a
> higher level, in the namespace setup code. Maybe root should be able
> to create a namespace that doesn't inherit ucount limits of its owner
> UID, or something like that...

If we allow rlimits not to be inherited in the userns being created, the
user will be able to bypass their rlimits by running a fork bomb inside
the new userns.

Or I missed your point ?

In init_user_ns all rlimits that are bound to it are set to RLIM_INFINITY.
So root can only reduce rlimits.

https://web.git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c#n1091

-- 
Rgrds, legion



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter
  2025-05-19 21:01   ` Alexey Gladkov
@ 2025-05-19 21:24     ` Jann Horn
  2025-05-21  1:48       ` Chen Ridong
  0 siblings, 1 reply; 8+ messages in thread
From: Jann Horn @ 2025-05-19 21:24 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Chen Ridong, akpm, Liam.Howlett, lorenzo.stoakes, vbabka,
	pfalcato, bigeasy, paulmck, chenridong, roman.gushchin, brauner,
	pmladek, geert, mingo, rrangel, francesco, kpsingh,
	guoweikang.kernel, link, viro, neil, nichen, tglx, frederic,
	peterz, oleg, joel.granados, linux, avagin, linux-kernel,
	linux-mm, lujialin4, Serge E. Hallyn, David Howells

On Mon, May 19, 2025 at 11:01 PM Alexey Gladkov <legion@kernel.org> wrote:
> On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote:
> > On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> > > From: Chen Ridong <chenridong@huawei.com>
> > >
> > > The will-it-scale test case signal1 [1] has been observed. and the test
> > > results reveal that the signal sending system call lacks linearity.
> > > To further investigate this issue, we initiated a series of tests by
> > > launching varying numbers of dockers and closely monitored the throughput
> > > of each individual docker. The detailed test outcomes are presented as
> > > follows:
> > >
> > >         | Dockers     |1      |4      |8      |16     |32     |64     |
> > >         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> > >
> > > The data clearly demonstrates a discernible trend: as the quantity of
> > > dockers increases, the throughput per container progressively declines.
> >
> > But is that actually a problem? Do you have real workloads that
> > concurrently send so many signals, or create inotify watches so
> > quickly, that this is has an actual performance impact?
> >
> > > In-depth analysis has identified the root cause of this performance
> > > degradation. The ucouts module conducts statistics on rlimit, which
> > > involves a significant number of atomic operations. These atomic
> > > operations, when acting on the same variable, trigger a substantial number
> > > of cache misses or remote accesses, ultimately resulting in a drop in
> > > performance.
> >
> > You're probably running into the namespace-associated ucounts here? So
> > the issue is probably that Docker creates all your containers with the
> > same owner UID (EUID at namespace creation), causing them all to
> > account towards a single ucount, while normally outside of containers,
> > each RUID has its own ucount instance?
> >
> > Sharing of rlimits between containers is probably normally undesirable
> > even without the cacheline bouncing, because it means that too much
> > resource usage in one container can cause resource allocations in
> > another container to fail... so I think the real problem here is at a
> > higher level, in the namespace setup code. Maybe root should be able
> > to create a namespace that doesn't inherit ucount limits of its owner
> > UID, or something like that...
>
> If we allow rlimits not to be inherited in the userns being created, the
> user will be able to bypass their rlimits by running a fork bomb inside
> the new userns.
>
> Or I missed your point ?

You're right, I guess it would actually still be necessary to have one
shared limit across the entire container, so rather than not having a
namespace-level ucount, maybe it would make more sense to have a
private ucount instance for a container...

(But to be clear I'm not invested in this suggestion at all, I just
looked at that patch and was wondering about alternatives if that is
actually a real performance problem...)


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter
  2025-05-19 21:24     ` Jann Horn
@ 2025-05-21  1:48       ` Chen Ridong
  0 siblings, 0 replies; 8+ messages in thread
From: Chen Ridong @ 2025-05-21  1:48 UTC (permalink / raw)
  To: Jann Horn, Alexey Gladkov
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, pfalcato, bigeasy,
	paulmck, chenridong, roman.gushchin, brauner, pmladek, geert,
	mingo, rrangel, francesco, kpsingh, guoweikang.kernel, link,
	viro, neil, nichen, tglx, frederic, peterz, oleg, joel.granados,
	linux, avagin, linux-kernel, linux-mm, lujialin4,
	Serge E. Hallyn, David Howells



On 2025/5/20 5:24, Jann Horn wrote:
> On Mon, May 19, 2025 at 11:01 PM Alexey Gladkov <legion@kernel.org> wrote:
>> On Mon, May 19, 2025 at 09:32:17PM +0200, Jann Horn wrote:
>>> On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>>> From: Chen Ridong <chenridong@huawei.com>
>>>>
>>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>>> results reveal that the signal sending system call lacks linearity.
>>>> To further investigate this issue, we initiated a series of tests by
>>>> launching varying numbers of dockers and closely monitored the throughput
>>>> of each individual docker. The detailed test outcomes are presented as
>>>> follows:
>>>>
>>>>         | Dockers     |1      |4      |8      |16     |32     |64     |
>>>>         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>>
>>>> The data clearly demonstrates a discernible trend: as the quantity of
>>>> dockers increases, the throughput per container progressively declines.
>>>
>>> But is that actually a problem? Do you have real workloads that
>>> concurrently send so many signals, or create inotify watches so
>>> quickly, that this is has an actual performance impact?
>>>
>>>> In-depth analysis has identified the root cause of this performance
>>>> degradation. The ucouts module conducts statistics on rlimit, which
>>>> involves a significant number of atomic operations. These atomic
>>>> operations, when acting on the same variable, trigger a substantial number
>>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>>> performance.
>>>
>>> You're probably running into the namespace-associated ucounts here? So
>>> the issue is probably that Docker creates all your containers with the
>>> same owner UID (EUID at namespace creation), causing them all to
>>> account towards a single ucount, while normally outside of containers,
>>> each RUID has its own ucount instance?
>>>
>>> Sharing of rlimits between containers is probably normally undesirable
>>> even without the cacheline bouncing, because it means that too much
>>> resource usage in one container can cause resource allocations in
>>> another container to fail... so I think the real problem here is at a
>>> higher level, in the namespace setup code. Maybe root should be able
>>> to create a namespace that doesn't inherit ucount limits of its owner
>>> UID, or something like that...
>>
>> If we allow rlimits not to be inherited in the userns being created, the
>> user will be able to bypass their rlimits by running a fork bomb inside
>> the new userns.
>>
>> Or I missed your point ?
> 
> You're right, I guess it would actually still be necessary to have one
> shared limit across the entire container, so rather than not having a
> namespace-level ucount, maybe it would make more sense to have a
> private ucount instance for a container...
> 

It sounds like the private ucounts were what I was trying to implement
in version 1? It applies batch counts from the parent for each user
namespace, but the approach is complex.

Best regards,
Ridong

> (But to be clear I'm not invested in this suggestion at all, I just
> looked at that patch and was wondering about alternatives if that is
> actually a real performance problem...)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter
  2025-05-19 19:32 ` [RFC next v2 0/2] " Jann Horn
  2025-05-19 21:01   ` Alexey Gladkov
@ 2025-05-21  1:41   ` Chen Ridong
  1 sibling, 0 replies; 8+ messages in thread
From: Chen Ridong @ 2025-05-21  1:41 UTC (permalink / raw)
  To: Jann Horn
  Cc: akpm, Liam.Howlett, lorenzo.stoakes, vbabka, pfalcato, bigeasy,
	paulmck, chenridong, roman.gushchin, brauner, pmladek, geert,
	mingo, rrangel, francesco, kpsingh, guoweikang.kernel, link,
	viro, neil, nichen, tglx, frederic, peterz, oleg, joel.granados,
	linux, avagin, legion, linux-kernel, linux-mm, lujialin4,
	Serge E. Hallyn, David Howells



On 2025/5/20 3:32, Jann Horn wrote:
> On Mon, May 19, 2025 at 3:25 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>> From: Chen Ridong <chenridong@huawei.com>
>>
>> The will-it-scale test case signal1 [1] has been observed. and the test
>> results reveal that the signal sending system call lacks linearity.
>> To further investigate this issue, we initiated a series of tests by
>> launching varying numbers of dockers and closely monitored the throughput
>> of each individual docker. The detailed test outcomes are presented as
>> follows:
>>
>>         | Dockers     |1      |4      |8      |16     |32     |64     |
>>         | Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>
>> The data clearly demonstrates a discernible trend: as the quantity of
>> dockers increases, the throughput per container progressively declines.
> 
> But is that actually a problem? Do you have real workloads that
> concurrently send so many signals, or create inotify watches so
> quickly, that this is has an actual performance impact?
> 

Thanks Jann,
Unfortunately, I do not have the specific scenario.

>> In-depth analysis has identified the root cause of this performance
>> degradation. The ucouts module conducts statistics on rlimit, which
>> involves a significant number of atomic operations. These atomic
>> operations, when acting on the same variable, trigger a substantial number
>> of cache misses or remote accesses, ultimately resulting in a drop in
>> performance.
> 
> You're probably running into the namespace-associated ucounts here? So
> the issue is probably that Docker creates all your containers with the
> same owner UID (EUID at namespace creation), causing them all to
> account towards a single ucount, while normally outside of containers,
> each RUID has its own ucount instance?
> 

Yes, every time rlimits change in the containers, they have to change
the parent's rlimits, which are atomic options, even if these containers
have their own user_namespace.

Best regards,
Ridong

> Sharing of rlimits between containers is probably normally undesirable
> even without the cacheline bouncing, because it means that too much
> resource usage in one container can cause resource allocations in
> another container to fail... so I think the real problem here is at a
> higher level, in the namespace setup code. Maybe root should be able
> to create a namespace that doesn't inherit ucount limits of its owner
> UID, or something like that...



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-05-21  1:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-19 13:11 [RFC next v2 0/2] ucounts: turn the atomic rlimit to percpu_counter Chen Ridong
2025-05-19 13:11 ` [RFC next v2 1/2] ucounts: free ucount only count and rlimit are zero Chen Ridong
2025-05-19 13:11 ` [RFC next v2 2/2] ucounts: turn the atomic rlimit to percpu_counter Chen Ridong
2025-05-19 19:32 ` [RFC next v2 0/2] " Jann Horn
2025-05-19 21:01   ` Alexey Gladkov
2025-05-19 21:24     ` Jann Horn
2025-05-21  1:48       ` Chen Ridong
2025-05-21  1:41   ` Chen Ridong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox