[RFC 0/6] Managed Percpu Refcount

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/6] Managed Percpu Refcount
@ 2024-09-16  5:08 Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 1/6] percpu-refcount: Add managed mode for RCU released objects Neeraj Upadhyay
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: Neeraj Upadhyay @ 2024-09-16  5:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: john.johansen, paul, jmorris, serge, linux-security-module,
	gautham.shenoy, Santosh.Shukla, Ananth.Narayan,
	Raghavendra.KodsaraThimmappa, paulmck, boqun.feng,
	vinicius.gomes, mjguzik, dennis, tj, cl, linux-mm, rcu

Introduction
------------

This patch series adds a new "managed mode" to percpu-refcounts for
managing references for objects that are released after an RCU grace
period has passed since their last reference drop.

Typical usage pattern looks like below

// Called with elevated refcount
get()
    p = get_ptr();
    kref_get(&p->count);
    return p;

get()
    rcu_read_lock();
    p = get_ptr();
    if (p && !kref_get_unless_zero(&p->count))
        p = NULL;
    rcu_read_unlock();
    return p;

release()
    remove_ptr(p);
    call_rcu(&p->rcu, freep);

release()
    remove_ptr(p);
    kfree_rcu((p, rcu);

Requirement and Use Case
------------------------

Percpu refcount requires an explicit percpu_ref_kill() operation at the
object's usage site where the initial ref count is being dropped. For
optimal performance, the object's usage should reach a teardown point,
after which the references shouldn't be acquired or released frequently
before the final reference is dropped. Following the percpu_ref_kill(),
any refcount operations on the object are carried out on the
centralized atomic counter. The performance and scalability of those
usages decrease if the references are still being added or removed
after the percpu_ref_kill() operation because of the atomic counter's
cache line ping-pong between CPUs.

The throughput scalability issue that is seen when Nginx runs with the
AppArmor linux security module enabled is the primary motivation for
this change. Performance profiling shows that memory contention in the
atomic_fetch_add and atomic_fetch_sub operations carried out in
kref_get() and kref_put() operations on AppArmor labels accounts for
the majority of CPU cycles. Further information regarding the impact
of performance on Nginx throughput scalabilityand enhancements through
percpu references can be found in [1].

However, because of the way references are used in AppArmor, switching
from kref usage to per-cpu refcount was found to be non-trivial.

Although the specifics of AppArmor refcount management have already
been covered at [1], the explanation that follows aims to update that
information with more detailed (and hopefully more accurate)
information that support the requirement of managed percpu ref.

Within the AppArmor framework, label (struct aa_label) manages
references for different kinds of objects. Labels are associated with:
 - Profiles for applications.
 - Namespaces, via their unconfined profile.
 - Audit, secmark rules and compound labels.

Labels are referenced by file contexts, security contexts, secid,
sockets.

The diagram below illustrates the relationship between different
AppArmor objects via their label references.

                 ----------------
                | Root Namespace |
                 ----------------
                /   ^        |   ^
              (a)   |       (c)  |
              /    (b)       |  (d)
             v     /         v   |
       ------------        -----------------
      | Profile 1 |       | Child Namespace |
       ------------        -----------------
          |   ^               |    ^
         (e)  |              (g)   |
          |  (f)              |   (h)
          v   |               v    |
      ---------------       -----------
     | Child Profile |     | Profile 2 |   
      ---------------       -----------
                 ^           ^
                  \         /
                   \       /
                    \     /
                      (i)
                       |
                ----------------
               | Compound Label |
                ----------------

(a) The Root namespace keeps track of every profile that exists there.
    When a profile is loaded and unpacked, a reference to profile is
    taken for this. This reference to the profile object is also used
    its **init reference**.

(b) Root namespace is referenced by a profile that is part of it.

(c) To control confinement within a certain domain, such as a chroot
    environment, a root namespace may include child namespaces. Through
    each child namespace's unconfined label, the subnamespaces list in
    the root namespace maintains a (init) reference to child
    namespaces.

(d) A child namespace maintains a reference to its parent namespace.

(e) Profile can have child subprofiles which are called hat profiles.
    Certain program segments can be run with permissions differing
    from the base permissions using these profiles. For instance,
    executing user-supplied CGI programs in a different Apache profile,
    or running authorized and unauthenticated traffic in several
    OpenSSH profiles. By use of its policy profiles list, the parent
    profile maintains a reference to the child subprofiles. This serves
    as the child profile's init reference.

(f) Child profiles keep a reference to their parent profile.

(g) Child namespace keeps a reference to all  profiles in it.

(h) A reference to the parent non-root namespace is maintained by child
    profiles.

(i) Application of context-specific application confinement is done
    using compound/stack labels. When ls is started from bash, for
    instance, the confinement rules for the profile /bin/bash///bin/ls
    may differ from the system-level rules for ls execution. Compund
    labels are vector of profiles and maintain reference to every
    profile in its vector.

Label references
----------------

- Tasks are linked to labels via the security field of their cred. The
  cred label is copied from the parent task during the bprm exec's cred
  preparation, and the bprm is transitioned to the new label using the
  parent task's profile transition rules. A compound/stack label or the
  label of a single profile may be used in the transition depending on
  the perms rule for the bprm's path.

  When performing policy checks in AppArmor's security hooks for
  operations like file permissions, mkdir, rmdir, mount, and so on, the
  label linked to the task's cred is used. When the associated label is
  marked as stale, the cred label of a task can change (from within its
  context) while it is being executed.

  A task maintains references to previous labels for hat transitions,
  onexec labels, and nnp (no new privilege) labels for exec domain
  transition checks.

  Labels are cached in file context for file permissions checks on open
  files. As a result of task profile updates, this label is updated
  with new profiles from the task's current label during revalidations
  of cached file permissions.

- Socket contexts store the labels of the current task and peer.

- Profile fs maintains references to the label proxy and namespace in
  the inode->i_private fields.

- The label parsed from the rule string is referenced by Secmark rule
  objects.

- The label parsed from the rule string is referenced by audit rule
  objects.

Label's Initial Ref Teardown
----------------------------

- When a profile is deleted, the initial reference on its label is
  dropped and it is no longer a part of the parent namespace or
  parent profile. Furthermore, every one of its child profiles is
  deleted recursively. As a result, all profiles that are reachable
  from the base profile have their initial reference removed in a
  cascaded manner.

- When a namespace is destroyed, the initial reference to its
  unconfined label is dropped and it is removed from the parent
  namespace view. Furthermore, all profiles in that namespace,
  all sub namespaces, and all profiles inside those sub namespaces
  are recursively removed and their initial label reference is dropped.

- The reference to parent label is dropped with the release of a label
  reference post its last reference drop. A profile's parent profile
  and namespace references are dropped upon ref release. On the
  namespace ref release path, a namespace drops its reference to its
  parent namespace. As part of the label release, references to
  profiles in the compound label's vector are removed.

Stale Labels and Label Redirection
----------------------------------

- The label associated with profile/namespace that is deleted is marked
  as stale. When any profile of a compound label is stale, the compound
  label is also marked stale.

- Label's proxy is used to redirect stale labels to the most recent or
  active version of the object. For example, when a profile is deleted,
  its proxy is redirected to the unconfined label of the namespace. This
  indicates that every application that the profile confined has been
  moved to an unconfined profile. In a same manner, proxy is redirected
  to the new profile's label when a profile is replaced. The proxy of a
  namespace's unconfined label is redirected to the unconfined label of
  its parent namespace on namespace deletion.

  Redirection to new label is done during reference get operation:

  struct aa_label *aa_get_newest_label(struct aa_label *l)
  {
    struct aa_label __rcu **l = &l->proxy->label;
    struct aa_label *c;

    rcu_read_lock();
    do {
        c = rcu_dereference(*l);
    } while (c && !kref_get_unless_zero(&c->count));
    rcu_read_unlock();

    return c;
  }

Label reclaims
--------------

A label is completely initialized when it is linked to a namespace.
Label destruction is deferred until the end of a RCU grace period which
starts after the last reference drop. Enqueuing an RCU callback for
label and associated object destruction is done from the ref release
callback.

void aa_label_kref(struct kref *kref)
{
  struct aa_label *label = container_of(kref, struct aa_label, count);
  struct aa_ns *ns = labels_ns(label);

  if (!ns) {
    label_free_switch(label);
    return;
  }

  call_rcu(&label->rcu, label_free_rcu);
}

Using Label Stale operation for percpu_ref_kill()?
--------------------------------------------------

Marking a label as stale can serve as a reference termination point
since stale labels are redirected to the current label linked to its
objects. There are other labels, though, that are not associated with
namespaces or profiles. These labels are compound labels linked to
audit and secmark rule rules or running tasks that contain those
label references in their cred structure. These labels are:

- The label that is created from rule string is referenced by audit
  rules. It is possible that a multi element vector audit rule label
  already exists in the root labelset or that a new label is created
  during audit rule init. The reference is removed upon audit rule
  free. It's possible that the created label is actively referenced
  from other contexts, causing atomic contention on the label's ref
  operations if percpu_ref_kill() is called on audit rule free.

- The stacked labels which are created on profile exec/domain
  transitions are stored in task's cred structure. These labels are
  released when all tasks drop their cred reference to those labels.

- Transition labels which are created during change hat or change
  profile transitions could be referenced by multiple tasks. These
  labels are released when all tasks  drop their cred reference to
  those labels.

- Tasks' most recent label is combined with and cached in open file
  contexts. These cached labels don't have a defined termination point
  and can be actively referenced from multiple contexts.

- Other compound labels with similar ref lifetimes include pivotroot
  and secmark rules.

There exist further scenarios in which stale references may still be
referenced:

- Stale flags on labels are set using plain writes, and until the CPU
  observes the stale flag, new references may be incremented or
  decreased on the stale label.

- A task may make reference a namespace which is marked stale.

- Stale cred label, for which a proxy points to its namespace's stale
  unconfined label, the stale unconfined label can be referenced until
  the cred label is updated.

In summary, though percpuref kill can be used for labels when they are
maked stale, compound labels are not guaranteed to be marked stale
during their lifetime and they do not have a context where percpuref
kill can be done.

Proposed Solution
-----------------

The solution proposed here attempt to address the issue of
identifying the init reference drop context. A percpu ref manager
thread keeps an extra reference to the ref. This additional reference
is used as a (pseudo) init reference to the object. A percpu managed
ref instance offloads its ref's release work to the ref manager thread.

The ref manager thread uses the following sequence to periodically scan
the list of managed refs and determine whether a ref is active:

scan_ref() {
  bool active;

  percpu_ref_switch_to_atomic_sync(&ref);
  rcu_read_lock();
  percpu_ref_put(&ref);
  active = percpu_ref_tryget(&ref);
  rcu_read_unlock();
  if (active)
    percpu_ref_switch_to_percpu(&ref);
}

The sequence above drops the pseudo-init reference, converts the
reference to atomic mode, and verifies (within RCU read side
protection) that all references have been dropped. The reference
is switched back to perCPU mode (with the pseudo-init reference
obtained through the try operation) if there are any active
references.

The two approaches used in this patch series, with slightly differing
permitted ref mode switches and semantics, are listed below.

Approach 1
----------

Approach 1 is implemented in patch 1 and has below semantics for ref
init and switch.

a. Init

A ref can be set to managed mode at initialization time in
percpu_ref_init(), by passing the PERCPU_REF_REL_MANAGED flag, or by
calling percpu_ref_switch_to_managed() post init to switch a
reinitable ref to managed mode. Deferred switches are used in
situations like module initialization error, when the reference to
an inited reference is released before the object is used. One example
of this is the release of AppArmor labels which are not associated with a
namespace, which is done without waiting for RCU grace period.

Below are the allowed initialization modes for managed ref

               Atomic  Percpu   Dead  Reinit  Managed
Managed-ref       Y        N      Y      Y       Y

b. Switching modes and operations

Below are the allowed transitions for managed ref.

To -->       A    P    P(RI)    M    D   D(RI)   D(RI/M)    KLL    REI    RES

  A          y    n      y      y    n     y        y        y      y      y
  P          n    n      n      n    y     n        n        y      n      n    
  M          n    n      n      y    n     n        y        n      y      y
  P(RI)      y    n      y      y    n     y        y        y      y      y
  D(RI)      y    n      y      y    n     y        y        -      y      y
  D(RI/M)    n    n      n      y    n     n        y        -      y      y

Modes:
A - Atomic  P - PerCPU  M - Managed  P(RI) - PerCPU with ReInit
D(RI) - Dead with ReInit  D(RI/M) - Dead with ReInit and Managed

PerCPU Ref Ops:

KLL - Kill  REI - Reinit  RES - Resurrect

A percpu reference that has been switched to managed mode cannot be
switched back to any other active mode. Managed ref is reinitialized
to managed mode upon reinit/resurrect.

Approach 2
----------

The second approach provides a managed reference greater runtime mode
switching flexibility. This may be helpful in situations where the object
of a managed reference can enter a shutdown phase in some scenarios. For
example, for stale singular/compund labels, user can directly call
percpu_ref_kill() for the ref rather than waiting for the manager
thread to process the ref.

The init modes are the same as in the previous approach. Runtime mode
switching provides the ability to convert from managed mode to
unmanaged mode, hence enabling transitions to all reinitable modes.

To -->       A    P    P(RI)    M    D   D(RI)   D(RI/M)    KLL    REI    RES

  A          y    n      y      y    n     y        y        y      y      y
  P          n    n      n      n    y     n        n        y      n      n    
  M          y*   n      y*     y    n     y*       y        y*     y      y
  P(RI)      y    n      y      y    n     y        y        y      y      y
  D(RI)      y    n      y      y    n     y        y        -      y      y
  D(RI/M)    y*   n      y*     y    n     y*       y        -      y      y

(RI) refers to modes whose initialization was done using
PERCPU_REF_ALLOW_REINIT. The aforementioned transitions are permitted
and may be indirect transitions. For example, when
percpu_ref_switch_to_unmanaged() is invoked for it, managed ref
switches to P(RI) mode. percpu_ref_switch_to_atomic() can be used to
switch from P(RI) mode to A mode.

Design Implications
-------------------

1. Deferring the release of a referenced object to the manager thread
   may delay its memory release. This can result in memory pressure.
   By turning a managed reference to an unmanaged ref and then
   executing percpu_ref_kill() on it at known shutdown points in
   the execution, this issue can be partially resolved using the
   second approach.

   Flush the scanning work on memory pressure is another strategy that
   can be used.

2. call_rcu_hurry() is used by percpu refcount lib to perform mode
   switch operations. Back to back hurry callbacks can impact energy
   efficiency. The current implementation allows moving the execution
   to housekeeping cores by using an unbounded workqueue. A deferrable
   timer can be used to prevent these invocations when the core is
   idle by delaying the worker execution. Deferring, though, may cause
   ref reclaims to be delayed.

3. Since the percpu refcount lib uses a single global switch spinlock,
   back-to-back label switches can delay other percpu users.

4. Long running kworkers may cause other use cases, such as system
   suspend, to be delayed. By using a freezable work queue and limiting
   node scans to a maximum count, this is mitigated.

5. Because all managed refs undergo switch-to-atomic mode operation
   serially, an inactive ref must wait for all prior grace periods to
   complete before it can be assessed. Ref release may be greatly
   delayed as a result of this. Batching ref switches can be one
   method to deal with this, ensuring that all of those RCU callbacks
   are completed by single grace period.

6. A label's refcount can operate in atomic mode within the window
   while its counter is being checked for zero. This could lead to
   high memory contention within the RCU grace period (together with
   callback execution) duration. In AppArmor, all application that use
   unconfined profiles will execute atomic ref increment and decrement
   operations on the ref during that window if the currently scanned
   label belongs to an unconfined profile. In order to handle this,
   a prototype is described and implemented in [1], which replaces the
   atomic and percpu counters of the scanned ref with a temporary
   percpu ref. Given that the grace period window is of small duration
   (compared to the scan interval), overall impact of this might not be
   significant enough to consider the massive complexity of that
   prototype implementation. This problem requires more investigation
   in order to find a simpler solution.

Extended/Future Work
--------------------

1. Another design approach, which was considered was to define a new
   percpu rcuref type for RCU managed percpu refcounts. This approach
   is prototyped in [1]. Although this approach provides cleaner
   semantics w.r.t. mode switches and allowed operations, its current
   implementation, using composition of percpu ref, could be suboptimal
   in terms of  the struct's cacheline space requirement and feature
   extensibility. An independent implementation would require
   refactoring of the common logic out of the percpu refcount
   implementation. Additionally, the users of new api could require
   the modes (ex. ref kill/reinit) supported by percpu refcount.
   Extending percpu rcuref to support this can result in duplication
   of functionality/semantics between the two percpu ref types.

2. Explore hazard pointers for scalable refcounting of objects, which
   provides a more generic solution and has more efficient memory
   space requirements.

Below is the organization of the patches in this series:

1. Implementation of first approach described in "Proposed Solution"
   section.

2. Torture test for managed ref to validate early ref release and
   imbalanced refcount.

   The test is verified on AMD 4th Generation EPYC Processor wth 96C/192T
   with following test parameters:

   nusers = 300
   nrefs = 50
   niterations = 50000
   onoff_holdoff = 5
   onoff_interval = 10

3. Implementation of second approach described in "Proposed Solution"
   section.

4. Updates to torture test to test runtime mode switches from managed
   to unmanaged modes.

5. Switch Label refcount management to percpu ref in atomic mode.

6. Switch Label refcount management to managed mode.

Highly appreciate any feedback/suggestions on the design approach.

[1] https://lore.kernel.org/lkml/20240110111856.87370-7-Neeraj.Upadhyay@amd.com/T/

- Neeraj

Neeraj Upadhyay (6):
  percpu-refcount: Add managed mode for RCU released objects
  percpu-refcount: Add torture test for percpu refcount
  percpu-refcount: Extend managed mode to allow runtime switching
  percpu-refcount-torture: Extend test with runtime mode switches
  apparmor: Switch labels to percpu refcount in atomic mode
  apparmor: Switch labels to percpu ref managed mode

 .../admin-guide/kernel-parameters.txt         |  69 +++
 include/linux/percpu-refcount.h               |  14 +
 lib/Kconfig.debug                             |   9 +
 lib/Makefile                                  |   1 +
 lib/percpu-refcount-torture.c                 | 404 ++++++++++++++++++
 lib/percpu-refcount.c                         | 329 +++++++++++++-
 lib/percpu-refcount.h                         |   6 +
 security/apparmor/include/label.h             |  16 +-
 security/apparmor/include/policy.h            |   8 +-
 security/apparmor/label.c                     |  12 +-
 security/apparmor/policy_ns.c                 |   2 +
 11 files changed, 836 insertions(+), 34 deletions(-)
 create mode 100644 lib/percpu-refcount-torture.c
 create mode 100644 lib/percpu-refcount.h

-- 
2.34.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 1/6] percpu-refcount: Add managed mode for RCU released objects
  2024-09-16  5:08 [RFC 0/6] Managed Percpu Refcount Neeraj Upadhyay
@ 2024-09-16  5:08 ` Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 2/6] percpu-refcount: Add torture test for percpu refcount Neeraj Upadhyay
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Neeraj Upadhyay @ 2024-09-16  5:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: john.johansen, paul, jmorris, serge, linux-security-module,
	gautham.shenoy, Santosh.Shukla, Ananth.Narayan,
	Raghavendra.KodsaraThimmappa, paulmck, boqun.feng,
	vinicius.gomes, mjguzik, dennis, tj, cl, linux-mm, rcu

Add a new "managed mode" to percpu refcounts, to track initial
reference drop for refs which use RCU grace period for their object
reclaims. Typical usage pattern for such refs is:

// Called with elevated refcount
get()
    p = get_ptr();
    kref_get(&p->count);
    return p;

get()
    rcu_read_lock();
    p = get_ptr();
    if (p && !kref_get_unless_zero(&p->count))
        p = NULL;
    rcu_read_unlock();
    return p;

release()
    remove_ptr(p);
    call_rcu(&p->rcu, freep);

release()
    remove_ptr(p);
    kfree_rcu((p, rcu);

Currently, percpu ref requires users to call percpu_ref_kill() when
object usage enters a shutdown phase. Post killi operation, ref
increment/ decrement are performed on a atomic counter. For cases where
ref is actively acquired and released after percpu_ref_kill(),
percpu ref does not provide any performance benefits over using
an atomic reference counter. Managed mode offloads tracking of ref
kill to a manager thread, thereby not requiring users to explicitly
call percpu_ref_kill(). This helps avoid the problem of suboptimal
performance if a percpu ref is actively acquired and released after
percpu_ref_kill() operation.

A percpu ref can be initialized as managed either during
percpu_ref_init() by passing PERCPU_REF_REL_MANAGED flag or a
reinitable ref can be switched to managed mode using
percpu_ref_switch_to_managed() post its initialization. Deferred switch
to managed mode can be used for cases like module initialization
errors, where a inited percpu ref's initial reference is dropped before
the object becomes active and is referenced by other contexts. One such
case is Apparmor labels which are not associated yet with a namespace.
These labels are freed without waiting for a RCU grace period. So,
managed mode cannot be used for these labels until their initialization
has completed.

Following are the allowed initialization modes for managed ref:

                Atomic  Percpu   Dead  Reinit  Managed
Managed-ref       Y        N      Y      Y       Y

Following are the allowed transitions for managed ref:

To -->       A   P   P(RI)   M   D  D(RI)  D(RI/M)   KLL   REI   RES

  A          y   n     y     y   n    y       y       y     y     y
  P          n   n     n     n   y    n       n       y     n     n
  M          n   n     n     y   n    n       y       n     y     y
  P(RI)      y   n     y     y   n    y       y       y     y     y
  D(RI)      y   n     y     y   n    y       y       -     y     y
  D(RI/M)    n   n     n     y   n    n       y       -     y     y

Modes:
A - Atomic  P - PerCPU  M - Managed  P(RI) - PerCPU with ReInit
D(RI) - Dead with ReInit  D(RI/M) - Dead with ReInit and Managed

PerCPU Ref Ops:

KLL - Kill  REI - Reinit  RES - Resurrect

Once a percpu ref is switched to managed mode, it cannot be switched to
any other active mode. On reinit/resurrect, managed ref is reinitialized
in managed mode.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 .../admin-guide/kernel-parameters.txt         |  12 +
 include/linux/percpu-refcount.h               |  13 +
 lib/percpu-refcount.c                         | 358 +++++++++++++++++-
 3 files changed, 364 insertions(+), 19 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 09126bb8cc9f..0f02a1b04fe9 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4665,6 +4665,18 @@
 			allocator.  This parameter is primarily	for debugging
 			and performance comparison.
 
+	percpu_refcount.max_scan_count= [KNL]
+			Specifies the maximum number of percpu ref nodes which
+			are processed in one run of percpu ref manager thread.
+
+			Default: 100
+
+	percpu_refcount.scan_interval= [KNL]
+			Specifies the duration (ms) between two runs of manager
+			thread.
+
+			Default: 5000 ms
+
 	pirq=		[SMP,APIC] Manual mp-table setup
 			See Documentation/arch/x86/i386/IO-APIC.rst.
 
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index d73a1c08c3e3..e6aea81b3d01 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -68,6 +68,11 @@ enum {
 	__PERCPU_REF_FLAG_BITS	= 2,
 };
 
+/* Auxiliary flags */
+enum  {
+	__PERCPU_REL_MANAGED	= 1LU << 0,	/* operating in managed mode */
+};
+
 /* @flags for percpu_ref_init() */
 enum {
 	/*
@@ -90,6 +95,10 @@ enum {
 	 * Allow switching from atomic mode to percpu mode.
 	 */
 	PERCPU_REF_ALLOW_REINIT	= 1 << 2,
+	/*
+	 * Manage release of the percpu ref.
+	 */
+	PERCPU_REF_REL_MANAGED	= 1 << 3,
 };
 
 struct percpu_ref_data {
@@ -100,6 +109,9 @@ struct percpu_ref_data {
 	bool			allow_reinit:1;
 	struct rcu_head		rcu;
 	struct percpu_ref	*ref;
+	unsigned int		aux_flags;
+	struct llist_node	node;
+
 };
 
 struct percpu_ref {
@@ -126,6 +138,7 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_switch);
 void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref);
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
+int percpu_ref_switch_to_managed(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill);
 void percpu_ref_resurrect(struct percpu_ref *ref);
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 668f6aa6a75d..7b97f9728c5b 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -5,6 +5,9 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/slab.h>
+#include <linux/llist.h>
+#include <linux/moduleparam.h>
+#include <linux/types.h>
 #include <linux/mm.h>
 #include <linux/percpu-refcount.h>
 
@@ -38,6 +41,7 @@
 
 static DEFINE_SPINLOCK(percpu_ref_switch_lock);
 static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq);
+static LLIST_HEAD(percpu_ref_manage_head);
 
 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
 {
@@ -45,6 +49,8 @@ static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
 		(ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
 }
 
+int percpu_ref_switch_to_managed(struct percpu_ref *ref);
+
 /**
  * percpu_ref_init - initialize a percpu refcount
  * @ref: percpu_ref to initialize
@@ -80,6 +86,9 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
 		return -ENOMEM;
 	}
 
+	if (flags & PERCPU_REF_REL_MANAGED)
+		flags |= PERCPU_REF_ALLOW_REINIT;
+
 	data->force_atomic = flags & PERCPU_REF_INIT_ATOMIC;
 	data->allow_reinit = flags & PERCPU_REF_ALLOW_REINIT;
 
@@ -101,10 +110,73 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
 	data->confirm_switch = NULL;
 	data->ref = ref;
 	ref->data = data;
+	init_llist_node(&data->node);
+
+	if (flags & PERCPU_REF_REL_MANAGED)
+		percpu_ref_switch_to_managed(ref);
+
 	return 0;
 }
 EXPORT_SYMBOL_GPL(percpu_ref_init);
 
+static bool percpu_ref_is_managed(struct percpu_ref *ref)
+{
+	return (ref->data->aux_flags & __PERCPU_REL_MANAGED) != 0;
+}
+
+static void __percpu_ref_switch_mode(struct percpu_ref *ref,
+				     percpu_ref_func_t *confirm_switch);
+
+static int __percpu_ref_switch_to_managed(struct percpu_ref *ref)
+{
+	unsigned long __percpu *percpu_count;
+	struct percpu_ref_data *data;
+	int ret = -1;
+
+	data = ref->data;
+
+	if (WARN_ONCE(!percpu_ref_tryget(ref), "Percpu ref is not active"))
+		return ret;
+
+	if (WARN_ONCE(!data->allow_reinit, "Percpu ref does not allow switch"))
+		goto err_switch_managed;
+
+	if (WARN_ONCE(percpu_ref_is_managed(ref), "Percpu ref is already managed"))
+		goto err_switch_managed;
+
+	data->aux_flags |= __PERCPU_REL_MANAGED;
+	data->force_atomic = false;
+	if (!__ref_is_percpu(ref, &percpu_count))
+		__percpu_ref_switch_mode(ref, NULL);
+	/* Ensure ordering of percpu mode switch and node scan */
+	smp_mb();
+	llist_add(&data->node, &percpu_ref_manage_head);
+
+	return 0;
+
+err_switch_managed:
+	percpu_ref_put(ref);
+	return ret;
+}
+
+/**
+ * percpu_ref_switch_to_managed - Switch an unmanaged ref to percpu mode.
+ *
+ * @ref: percpu_ref to switch to managed mode
+ *
+ */
+int percpu_ref_switch_to_managed(struct percpu_ref *ref)
+{
+	unsigned long flags;
+	int ret;
+
+	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	ret = __percpu_ref_switch_to_managed(ref);
+	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(percpu_ref_switch_to_managed);
+
 static void __percpu_ref_exit(struct percpu_ref *ref)
 {
 	unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
@@ -283,6 +355,27 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref,
 		__percpu_ref_switch_to_percpu(ref);
 }
 
+static bool __percpu_ref_switch_to_atomic_checked(struct percpu_ref *ref,
+						  percpu_ref_func_t *confirm_switch,
+						  bool check_managed)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	if (check_managed && WARN_ONCE(percpu_ref_is_managed(ref),
+		      "Percpu ref is managed, cannot switch to atomic mode")) {
+		spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+		return false;
+	}
+
+	ref->data->force_atomic = true;
+	__percpu_ref_switch_mode(ref, confirm_switch);
+
+	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+
+	return true;
+}
+
 /**
  * percpu_ref_switch_to_atomic - switch a percpu_ref to atomic mode
  * @ref: percpu_ref to switch to atomic mode
@@ -306,17 +399,16 @@ static void __percpu_ref_switch_mode(struct percpu_ref *ref,
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_switch)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
-
-	ref->data->force_atomic = true;
-	__percpu_ref_switch_mode(ref, confirm_switch);
-
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	(void)__percpu_ref_switch_to_atomic_checked(ref, confirm_switch, true);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic);
 
+static void __percpu_ref_switch_to_atomic_sync_checked(struct percpu_ref *ref, bool check_managed)
+{
+	if (!__percpu_ref_switch_to_atomic_checked(ref, NULL, check_managed))
+		return;
+	wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch);
+}
 /**
  * percpu_ref_switch_to_atomic_sync - switch a percpu_ref to atomic mode
  * @ref: percpu_ref to switch to atomic mode
@@ -327,11 +419,28 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic);
  */
 void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref)
 {
-	percpu_ref_switch_to_atomic(ref, NULL);
-	wait_event(percpu_ref_switch_waitq, !ref->data->confirm_switch);
+	__percpu_ref_switch_to_atomic_sync_checked(ref, true);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
 
+static void __percpu_ref_switch_to_percpu_checked(struct percpu_ref *ref, bool check_managed)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+
+	if (check_managed && WARN_ONCE(percpu_ref_is_managed(ref),
+		      "Percpu ref is managed, cannot switch to percpu mode")) {
+		spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+		return;
+	}
+
+	ref->data->force_atomic = false;
+	__percpu_ref_switch_mode(ref, NULL);
+
+	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+}
+
 /**
  * percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode
  * @ref: percpu_ref to switch to percpu mode
@@ -352,14 +461,7 @@ EXPORT_SYMBOL_GPL(percpu_ref_switch_to_atomic_sync);
  */
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
-
-	ref->data->force_atomic = false;
-	__percpu_ref_switch_mode(ref, NULL);
-
-	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+	__percpu_ref_switch_to_percpu_checked(ref, true);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_percpu);
 
@@ -472,8 +574,226 @@ void percpu_ref_resurrect(struct percpu_ref *ref)
 
 	ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD;
 	percpu_ref_get(ref);
-	__percpu_ref_switch_mode(ref, NULL);
+	if (percpu_ref_is_managed(ref)) {
+		ref->data->aux_flags &= ~__PERCPU_REL_MANAGED;
+		__percpu_ref_switch_to_managed(ref);
+	} else {
+		__percpu_ref_switch_mode(ref, NULL);
+	}
 
 	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_resurrect);
+
+#define DEFAULT_SCAN_INTERVAL_MS    5000
+/* Interval duration between two ref scans. */
+static ulong scan_interval = DEFAULT_SCAN_INTERVAL_MS;
+module_param(scan_interval, ulong, 0444);
+
+#define DEFAULT_MAX_SCAN_COUNT      100
+/* Number of percpu refs scanned in one iteration of worker execution. */
+static int max_scan_count = DEFAULT_MAX_SCAN_COUNT;
+module_param(max_scan_count, int, 0444);
+
+static void percpu_ref_release_work_fn(struct work_struct *work);
+
+/*
+ * Sentinel llist nodes for lockless list traveral and deletions by
+ * the pcpu ref release worker, while nodes are added from
+ * percpu_ref_init() and percpu_ref_switch_to_managed().
+ *
+ * Sentinel node marks the head of list traversal for the current
+ * iteration of kworker execution.
+ */
+struct percpu_ref_sen_node {
+	bool inuse;
+	struct llist_node node;
+};
+
+/*
+ * We need two sentinel nodes for lockless list manipulations from release
+ * worker - first node will be used in current reclaim iteration. The second
+ * node will be used in next iteration. Next iteration marks the first node
+ * as free, for use in subsequent iteration.
+ */
+#define PERCPU_REF_SEN_NODES_COUNT     2
+
+/* Track last processed percpu ref node */
+static struct llist_node *last_percpu_ref_node;
+
+static struct percpu_ref_sen_node
+	percpu_ref_sen_nodes[PERCPU_REF_SEN_NODES_COUNT];
+
+static DECLARE_DELAYED_WORK(percpu_ref_release_work, percpu_ref_release_work_fn);
+
+static bool percpu_ref_is_sen_node(struct llist_node *node)
+{
+	return &percpu_ref_sen_nodes[0].node <= node &&
+		node <= &percpu_ref_sen_nodes[PERCPU_REF_SEN_NODES_COUNT - 1].node;
+}
+
+static struct llist_node *percpu_ref_get_sen_node(void)
+{
+	int i;
+	struct percpu_ref_sen_node *sn;
+
+	for (i = 0; i < PERCPU_REF_SEN_NODES_COUNT; i++) {
+		sn = &percpu_ref_sen_nodes[i];
+		if (!sn->inuse) {
+			sn->inuse = true;
+			return &sn->node;
+		}
+	}
+
+	return NULL;
+}
+
+static void percpu_ref_put_sen_node(struct llist_node *node)
+{
+	struct percpu_ref_sen_node *sn = container_of(node, struct percpu_ref_sen_node, node);
+
+	sn->inuse = false;
+	init_llist_node(node);
+}
+
+static void percpu_ref_put_all_sen_nodes_except(struct llist_node *node)
+{
+	int i;
+
+	for (i = 0; i < PERCPU_REF_SEN_NODES_COUNT; i++) {
+		if (&percpu_ref_sen_nodes[i].node == node)
+			continue;
+		percpu_ref_sen_nodes[i].inuse = false;
+		init_llist_node(&percpu_ref_sen_nodes[i].node);
+	}
+}
+
+static struct workqueue_struct *percpu_ref_release_wq;
+
+static void percpu_ref_release_work_fn(struct work_struct *work)
+{
+	struct llist_node *pos, *first, *head, *prev, *next;
+	struct llist_node *sen_node;
+	struct percpu_ref *ref;
+	int count = 0;
+	bool held;
+
+	first = READ_ONCE(percpu_ref_manage_head.first);
+	if (!first)
+		goto queue_release_work;
+
+	/*
+	 * Enqueue a dummy node to mark the start of scan. This dummy
+	 * node is used as start point of scan and ensures that
+	 * there is no additional synchronization required with new
+	 * label node additions to the llist. Any new labels will
+	 * be processed in next run of the kworker.
+	 *
+	 *                SCAN START PTR
+	 *                     |
+	 *                     v
+	 * +----------+     +------+    +------+    +------+
+	 * |          |     |      |    |      |    |      |
+	 * |   head   ------> dummy|--->|label |--->| label|--->NULL
+	 * |          |     | node |    |      |    |      |
+	 * +----------+     +------+    +------+    +------+
+	 *
+	 *
+	 * New label addition:
+	 *
+	 *                       SCAN START PTR
+	 *                            |
+	 *                            v
+	 * +----------+  +------+  +------+    +------+    +------+
+	 * |          |  |      |  |      |    |      |    |      |
+	 * |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
+	 * |          |  |      |  | node |    |      |    |      |
+	 * +----------+  +------+  +------+    +------+    +------+
+	 *
+	 */
+	if (last_percpu_ref_node == NULL || last_percpu_ref_node->next == NULL) {
+retry_sentinel_get:
+		sen_node = percpu_ref_get_sen_node();
+		/*
+		 * All sentinel nodes are in use? This should not happen, as we
+		 * require only one sentinel for the start of list traversal and
+		 * other sentinel node is freed during the traversal.
+		 */
+		if (WARN_ONCE(!sen_node, "All sentinel nodes are in use")) {
+			/* Use first node as the sentinel node */
+			head = first->next;
+			if (!head) {
+				struct llist_node *ign_node = NULL;
+				/*
+				 * We exhausted sentinel nodes. However, there aren't
+				 * enough nodes in the llist. So, we have leaked
+				 * sentinel nodes. Reclaim sentinels and retry.
+				 */
+				if (percpu_ref_is_sen_node(first))
+					ign_node = first;
+				percpu_ref_put_all_sen_nodes_except(ign_node);
+				goto retry_sentinel_get;
+			}
+			prev = first;
+		} else {
+			llist_add(sen_node, &percpu_ref_manage_head);
+			prev = sen_node;
+			head = prev->next;
+		}
+	} else {
+		prev = last_percpu_ref_node;
+		head = prev->next;
+	}
+
+	last_percpu_ref_node = NULL;
+	llist_for_each_safe(pos, next, head) {
+		/* Free sentinel node which is present in the list */
+		if (percpu_ref_is_sen_node(pos)) {
+			prev->next = pos->next;
+			percpu_ref_put_sen_node(pos);
+			continue;
+		}
+
+		ref = container_of(pos, struct percpu_ref_data, node)->ref;
+		__percpu_ref_switch_to_atomic_sync_checked(ref, false);
+		/*
+		 * Drop the ref while in RCU read critical section to
+		 * prevent obj free while we manipulating node.
+		 */
+		rcu_read_lock();
+		percpu_ref_put(ref);
+		held = percpu_ref_tryget(ref);
+		if (!held) {
+			prev->next = pos->next;
+			init_llist_node(pos);
+			ref->percpu_count_ptr |= __PERCPU_REF_DEAD;
+		}
+		rcu_read_unlock();
+		if (!held)
+			continue;
+		__percpu_ref_switch_to_percpu_checked(ref, false);
+		count++;
+		if (count == max_scan_count) {
+			last_percpu_ref_node = pos;
+			break;
+		}
+		prev = pos;
+	}
+
+queue_release_work:
+	queue_delayed_work(percpu_ref_release_wq, &percpu_ref_release_work,
+			   scan_interval);
+}
+
+static __init int percpu_ref_setup(void)
+{
+	percpu_ref_release_wq = alloc_workqueue("percpu_ref_release_wq",
+				WQ_UNBOUND | WQ_MEM_RECLAIM | WQ_FREEZABLE, 0);
+	if (!percpu_ref_release_wq)
+		return -ENOMEM;
+
+	queue_delayed_work(percpu_ref_release_wq, &percpu_ref_release_work,
+			   scan_interval);
+	return 0;
+}
+early_initcall(percpu_ref_setup);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 2/6] percpu-refcount: Add torture test for percpu refcount
  2024-09-16  5:08 [RFC 0/6] Managed Percpu Refcount Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 1/6] percpu-refcount: Add managed mode for RCU released objects Neeraj Upadhyay
@ 2024-09-16  5:08 ` Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 3/6] percpu-refcount: Extend managed mode to allow runtime switching Neeraj Upadhyay
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Neeraj Upadhyay @ 2024-09-16  5:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: john.johansen, paul, jmorris, serge, linux-security-module,
	gautham.shenoy, Santosh.Shukla, Ananth.Narayan,
	Raghavendra.KodsaraThimmappa, paulmck, boqun.feng,
	vinicius.gomes, mjguzik, dennis, tj, cl, linux-mm, rcu

Add torture test to verify percpu managed mode operations,
verifying that a percpu ref does not have non-zero count when
all users have dropped their reference and that there is no early
release of the ref while users hold references to it.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 .../admin-guide/kernel-parameters.txt         |  57 +++
 lib/Kconfig.debug                             |   9 +
 lib/Makefile                                  |   1 +
 lib/percpu-refcount-torture.c                 | 367 ++++++++++++++++++
 lib/percpu-refcount.c                         |  49 ++-
 lib/percpu-refcount.h                         |   6 +
 6 files changed, 483 insertions(+), 6 deletions(-)
 create mode 100644 lib/percpu-refcount-torture.c
 create mode 100644 lib/percpu-refcount.h

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 0f02a1b04fe9..225f2dac294d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4677,6 +4677,63 @@
 
 			Default: 5000 ms
 
+	percpu_refcount_torture.busted_early_ref_release= [KNL]
+			Enable testing buggy release of percpu ref while
+			there are active users. Used for testing failure
+			scenarios in the test.
+
+			Default: 0 (disabled)
+
+	percpu_refcount_torture.busted_late_ref_release= [KNL]
+			Enable testing buggy non-zero reference count after
+			all ref users have dropped their reference. Used for
+			testing failure scenarios in the test.
+
+			Default: 0 (disabled)
+
+	percpu_refcount_torture.delay_us= [KNL]
+			Delay (in us) between reference increment and decrement
+			operations of ref users.
+
+			Default: 10
+
+	percpu_refcount_torture.niterations= [KNL]
+			Number of iterations of ref increment and decrement by
+			ref users.
+
+			Default: 100
+
+	percpu_refcount_torture.nrefs= [KNL]
+			Number of percpu ref instances.
+
+			Default: 2
+
+	percpu_refcount_torture.nusers= [KNL]
+			Number of percpu ref user threads which increment and
+			decrement a percpu ref.
+
+			Default: 2
+
+	percpu_refcount_torture.onoff_holdoff= [KNL]
+			Set time (s) after boot for CPU-hotplug testing.
+
+			Default: 0
+
+	percpu_refcount_torture.onoff_interval= [KNL]
+			Set time (jiffies) between CPU-hotplug operations,
+			or zero to disable CPU-hotplug testing.
+
+	percpu_refcount_torture.stutter= [KNL]
+			Set wait time (jiffies) between two iterations of
+			percpu ref operations.
+
+			Default: 0
+
+	percpu_refcount_torture.verbose= [KNL]
+			Enable additional printk() statements.
+
+			Default: 0 (Disabled)
+
 	pirq=		[SMP,APIC] Manual mp-table setup
 			See Documentation/arch/x86/i386/IO-APIC.rst.
 
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index a30c03a66172..7e0117e01f05 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1611,6 +1611,15 @@ config SCF_TORTURE_TEST
 	  module may be built after the fact on the running kernel to
 	  be tested, if desired.
 
+config PERCPU_REFCOUNT_TORTURE_TEST
+	tristate "torture tests for percpu refcount"
+	select TORTURE_TEST
+	help
+	  This options provides a kernel module that runs percpu
+	  refcount torture tests for managed percpu refs. The kernel
+	  module may be built after the fact on the running kernel
+	  to be tested, if desired.
+
 config CSD_LOCK_WAIT_DEBUG
 	bool "Debugging for csd_lock_wait(), called from smp_call_function*()"
 	depends on DEBUG_KERNEL
diff --git a/lib/Makefile b/lib/Makefile
index 322bb127b4dc..d0286f7dfb37 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -50,6 +50,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
 	 once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
 	 generic-radix-tree.o bitmap-str.o
 obj-$(CONFIG_STRING_KUNIT_TEST) += string_kunit.o
+obj-$(CONFIG_PERCPU_REFCOUNT_TORTURE_TEST) += percpu-refcount-torture.o
 obj-y += string_helpers.o
 obj-$(CONFIG_STRING_HELPERS_KUNIT_TEST) += string_helpers_kunit.o
 obj-y += hexdump.o
diff --git a/lib/percpu-refcount-torture.c b/lib/percpu-refcount-torture.c
new file mode 100644
index 000000000000..686f5a228b40
--- /dev/null
+++ b/lib/percpu-refcount-torture.c
@@ -0,0 +1,367 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/percpu-refcount.h>
+#include <linux/torture.h>
+
+#include "percpu-refcount.h"
+
+static int busted_early_ref_release;
+module_param(busted_early_ref_release, int, 0444);
+MODULE_PARM_DESC(busted_early_ref_release,
+		 "Enable busted premature release of ref (default = 0), 0 = disable");
+
+static int busted_late_ref_release;
+module_param(busted_late_ref_release, int, 0444);
+MODULE_PARM_DESC(busted_late_ref_release,
+		 "Enable busted late release of ref (default = 0), 0 = disable");
+
+static long delay_us = 10;
+module_param(delay_us, long, 0444);
+MODULE_PARM_DESC(delay_us,
+		 "delay between reader refcount operations in microseconds (default = 10)");
+
+static long nrefs = 2;
+module_param(nrefs, long, 0444);
+MODULE_PARM_DESC(nrefs, "Number of percpu refs (default = 2)");
+
+static long niterations = 100;
+module_param(niterations, long, 0444);
+MODULE_PARM_DESC(niterations,
+		 "Number of iterations of ref increment and decrement (default = 100)");
+
+static long nusers = 2;
+module_param(nusers, long, 0444);
+MODULE_PARM_DESC(nusers, "Number of refcount users (default = 2)");
+
+static int onoff_holdoff;
+module_param(onoff_holdoff, int, 0444);
+MODULE_PARM_DESC(onoff_holdoff, "Time after boot before CPU hotplugs (seconds)");
+
+static int onoff_interval;
+module_param(onoff_interval, int, 0444);
+MODULE_PARM_DESC(onoff_interval, "Time between CPU hotplugs (jiffies), 0=disable");
+
+static int stutter;
+module_param(stutter, int, 0444);
+MODULE_PARM_DESC(stutter, "Stutter period in jiffies (default = 0), 0 = disable");
+
+static int verbose = 1;
+module_param(verbose, int, 0444);
+MODULE_PARM_DESC(verbose, "Enable verbose debugging printk()s");
+
+static struct task_struct **ref_user_tasks;
+static struct task_struct *ref_manager_task;
+static struct task_struct **busted_early_release_tasks;
+static struct task_struct **busted_late_release_tasks;
+
+static struct percpu_ref *refs;
+static long *num_per_ref_users;
+
+static atomic_t running;
+static atomic_t *ref_running;
+
+static char *torture_type = "percpu-refcount";
+
+static int percpu_ref_manager_thread(void *data)
+{
+	int i;
+
+	while (atomic_read(&running) != 0) {
+		percpu_ref_test_flush_release_work();
+		stutter_wait("percpu_ref_manager_thread");
+	}
+	/* Ensure ordering with ref users */
+	smp_mb();
+
+	percpu_ref_test_flush_release_work();
+
+	for (i = 0; i < nrefs; i++) {
+		WARN(percpu_ref_test_is_percpu(&refs[i]),
+			"!!! released ref %d should be in atomic mode", i);
+		WARN(!percpu_ref_is_zero(&refs[i]),
+			"!!! released ref %d should have 0 refcount", i);
+	}
+
+	do {
+		stutter_wait("percpu_ref_manager_thread");
+	} while (!torture_must_stop());
+
+	torture_kthread_stopping("percpu_ref_manager_thread");
+
+	return 0;
+}
+
+static int percpu_ref_test_thread(void *data)
+{
+	struct percpu_ref *ref = (struct percpu_ref *)data;
+	int i = 0;
+
+	percpu_ref_get(ref);
+
+	do {
+		percpu_ref_get(ref);
+		udelay(delay_us);
+		percpu_ref_put(ref);
+		stutter_wait("percpu_ref_test_thread");
+		i++;
+	} while (i < niterations);
+
+	atomic_dec(&ref_running[ref - refs]);
+	/* Order ref release with ref_running[ref_idx] == 0 */
+	smp_mb();
+	percpu_ref_put(ref);
+	/* Order ref decrement with running == 0 */
+	smp_mb();
+	atomic_dec(&running);
+
+	do {
+		stutter_wait("percpu_ref_test_thread");
+	} while (!torture_must_stop());
+
+	torture_kthread_stopping("percpu_ref_test_thread");
+
+	return 0;
+}
+
+static int percpu_ref_busted_early_thread(void *data)
+{
+	struct percpu_ref *ref = (struct percpu_ref *)data;
+	int ref_idx = ref - refs;
+	int i = 0, j;
+
+	do {
+		/* Extra ref put momemtarily */
+		for (j = 0; j < num_per_ref_users[ref_idx]; j++)
+			percpu_ref_put(ref);
+		stutter_wait("percpu_ref_busted_early_thread");
+		for (j = 0; j < num_per_ref_users[ref_idx]; j++)
+			percpu_ref_get(ref);
+		i++;
+		stutter_wait("percpu_ref_busted_early_thread");
+	} while (i < niterations * 10);
+
+	do {
+		stutter_wait("percpu_ref_busted_early_thread");
+	} while (!torture_must_stop());
+
+	torture_kthread_stopping("percpu_ref_busted_early_thread");
+
+	return 0;
+}
+
+static int percpu_ref_busted_late_thread(void *data)
+{
+	struct percpu_ref *ref = (struct percpu_ref *)data;
+	int i = 0;
+
+	do {
+		/* Extra ref get momemtarily */
+		percpu_ref_get(ref);
+		stutter_wait("percpu_ref_busted_late_thread");
+		percpu_ref_put(ref);
+		i++;
+	} while (i < niterations);
+
+	do {
+		stutter_wait("percpu_ref_busted_late_thread");
+	} while (!torture_must_stop());
+
+	torture_kthread_stopping("percpu_ref_busted_late_thread");
+
+	return 0;
+}
+
+static void percpu_ref_test_cleanup(void)
+{
+	int i;
+
+	if (torture_cleanup_begin())
+		return;
+
+	if (busted_late_release_tasks) {
+		for (i = 0; i < nrefs; i++)
+			torture_stop_kthread(busted_late_task, busted_late_release_tasks[i]);
+		kfree(busted_late_release_tasks);
+		busted_late_release_tasks = NULL;
+	}
+
+	if (busted_early_release_tasks) {
+		for (i = 0; i < nrefs; i++)
+			torture_stop_kthread(busted_early_task, busted_early_release_tasks[i]);
+		kfree(busted_early_release_tasks);
+		busted_early_release_tasks = NULL;
+	}
+
+	if (ref_manager_task) {
+		torture_stop_kthread(ref_manager, ref_manager_task);
+		ref_manager_task = NULL;
+	}
+
+	if (ref_user_tasks) {
+		for (i = 0; i < nusers; i++)
+			torture_stop_kthread(ref_user, ref_user_tasks[i]);
+		kfree(ref_user_tasks);
+		ref_user_tasks = NULL;
+	}
+
+	kfree(ref_running);
+	ref_running = NULL;
+
+	kfree(num_per_ref_users);
+	num_per_ref_users = NULL;
+
+	if (refs) {
+		for (i = 0; i < nrefs; i++)
+			percpu_ref_exit(&refs[i]);
+		kfree(refs);
+		refs = NULL;
+	}
+
+	torture_cleanup_end();
+}
+
+static void percpu_ref_test_release(struct percpu_ref *ref)
+{
+	WARN(!!atomic_add_return(0, &ref_running[ref-refs]), "!!! Premature ref release");
+}
+
+static int __init percpu_ref_torture_init(void)
+{
+	DEFINE_TORTURE_RANDOM(rand);
+	struct torture_random_state *trsp = &rand;
+	int flags;
+	int err;
+	int ref_idx;
+	int i;
+
+	if (!torture_init_begin("percpu-refcount", verbose))
+		return -EBUSY;
+
+	atomic_set(&running, nusers);
+	/* Order @running with later increment and decrement operations */
+	smp_mb();
+
+	refs = kcalloc(nrefs, sizeof(refs[0]), GFP_KERNEL);
+	if (!refs) {
+		TOROUT_ERRSTRING("out of memory");
+		err = -ENOMEM;
+		goto init_err;
+	}
+	for (i = 0; i < nrefs; i++) {
+		flags = torture_random(trsp) & 1 ? PERCPU_REF_INIT_ATOMIC : PERCPU_REF_REL_MANAGED;
+		err = percpu_ref_init(&refs[i], percpu_ref_test_release,
+				      flags, GFP_KERNEL);
+		if (err)
+			goto init_err;
+		if (!(flags & PERCPU_REF_REL_MANAGED))
+			percpu_ref_switch_to_managed(&refs[i]);
+	}
+
+	num_per_ref_users = kcalloc(nrefs, sizeof(num_per_ref_users[0]), GFP_KERNEL);
+	if (!num_per_ref_users) {
+		TOROUT_ERRSTRING("out of memory");
+		err = -ENOMEM;
+		goto init_err;
+	}
+	for (i = 0; i < nrefs; i++)
+		num_per_ref_users[i] = 0;
+
+	ref_user_tasks = kcalloc(nusers, sizeof(ref_user_tasks[0]), GFP_KERNEL);
+	if (!ref_user_tasks) {
+		TOROUT_ERRSTRING("out of memory");
+		err = -ENOMEM;
+		goto init_err;
+	}
+
+	ref_running = kcalloc(nrefs, sizeof(ref_running[0]), GFP_KERNEL);
+	if (!ref_running) {
+		TOROUT_ERRSTRING("out of memory");
+		err = -ENOMEM;
+		goto init_err;
+	}
+
+	for (i = 0; i < nusers; i++) {
+		ref_idx = torture_random(trsp) % nrefs;
+		atomic_inc(&ref_running[ref_idx]);
+		num_per_ref_users[ref_idx]++;
+		/* Order increments with subquent reads */
+		smp_mb();
+		err = torture_create_kthread(percpu_ref_test_thread,
+					     &refs[ref_idx], ref_user_tasks[i]);
+		if (torture_init_error(err))
+			goto init_err;
+	}
+
+	err = torture_create_kthread(percpu_ref_manager_thread, NULL, ref_manager_task);
+	if (torture_init_error(err))
+		goto init_err;
+
+	/* Drop initial reference, after test threads have started running */
+	udelay(1);
+	for (i = 0; i < nrefs; i++)
+		percpu_ref_put(&refs[i]);
+
+
+	if (busted_early_ref_release) {
+		busted_early_release_tasks = kcalloc(nrefs,
+						     sizeof(busted_early_release_tasks[0]),
+						     GFP_KERNEL);
+		if (!busted_early_release_tasks) {
+			TOROUT_ERRSTRING("out of memory");
+			err = -ENOMEM;
+			goto init_err;
+		}
+		for (i = 0; i < nrefs; i++) {
+			err = torture_create_kthread(percpu_ref_busted_early_thread,
+					     &refs[i], busted_early_release_tasks[i]);
+			if (torture_init_error(err))
+				goto init_err;
+		}
+	}
+
+	if (busted_late_ref_release) {
+		busted_late_release_tasks = kcalloc(nrefs, sizeof(busted_late_release_tasks[0]),
+						    GFP_KERNEL);
+		if (!busted_late_release_tasks) {
+			TOROUT_ERRSTRING("out of memory");
+			err = -ENOMEM;
+			goto init_err;
+		}
+		for (i = 0; i < nrefs; i++) {
+			err = torture_create_kthread(percpu_ref_busted_late_thread,
+					     &refs[i], busted_late_release_tasks[i]);
+			if (torture_init_error(err))
+				goto init_err;
+		}
+	}
+	if (stutter) {
+		err = torture_stutter_init(stutter, stutter);
+		if (torture_init_error(err))
+			goto init_err;
+	}
+
+	err = torture_onoff_init(onoff_holdoff * HZ, onoff_interval, NULL);
+	if (torture_init_error(err))
+		goto init_err;
+
+	torture_init_end();
+	return 0;
+init_err:
+	torture_init_end();
+	percpu_ref_test_cleanup();
+	return err;
+}
+
+static void __exit percpu_ref_torture_exit(void)
+{
+	percpu_ref_test_cleanup();
+}
+
+module_init(percpu_ref_torture_init);
+module_exit(percpu_ref_torture_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("percpu refcount torture test");
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 7b97f9728c5b..7d0c85c7ce57 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -11,6 +11,8 @@
 #include <linux/mm.h>
 #include <linux/percpu-refcount.h>
 
+#include "percpu-refcount.h"
+
 /*
  * Initially, a percpu refcount is just a set of percpu counters. Initially, we
  * don't try to detect the ref hitting 0 - which means that get/put can just
@@ -677,6 +679,7 @@ static void percpu_ref_release_work_fn(struct work_struct *work)
 	struct percpu_ref *ref;
 	int count = 0;
 	bool held;
+	struct llist_node *last_node = READ_ONCE(last_percpu_ref_node);
 
 	first = READ_ONCE(percpu_ref_manage_head.first);
 	if (!first)
@@ -711,7 +714,7 @@ static void percpu_ref_release_work_fn(struct work_struct *work)
 	 * +----------+  +------+  +------+    +------+    +------+
 	 *
 	 */
-	if (last_percpu_ref_node == NULL || last_percpu_ref_node->next == NULL) {
+	if (last_node == NULL || last_node->next == NULL) {
 retry_sentinel_get:
 		sen_node = percpu_ref_get_sen_node();
 		/*
@@ -741,11 +744,10 @@ static void percpu_ref_release_work_fn(struct work_struct *work)
 			head = prev->next;
 		}
 	} else {
-		prev = last_percpu_ref_node;
+		prev = last_node;
 		head = prev->next;
 	}
 
-	last_percpu_ref_node = NULL;
 	llist_for_each_safe(pos, next, head) {
 		/* Free sentinel node which is present in the list */
 		if (percpu_ref_is_sen_node(pos)) {
@@ -773,18 +775,53 @@ static void percpu_ref_release_work_fn(struct work_struct *work)
 			continue;
 		__percpu_ref_switch_to_percpu_checked(ref, false);
 		count++;
-		if (count == max_scan_count) {
-			last_percpu_ref_node = pos;
-			break;
+		if (count == READ_ONCE(max_scan_count)) {
+			WRITE_ONCE(last_percpu_ref_node, pos);
+			goto queue_release_work;
 		}
 		prev = pos;
 	}
 
+	WRITE_ONCE(last_percpu_ref_node, NULL);
 queue_release_work:
 	queue_delayed_work(percpu_ref_release_wq, &percpu_ref_release_work,
 			   scan_interval);
 }
 
+bool percpu_ref_test_is_percpu(struct percpu_ref *ref)
+{
+	unsigned long __percpu *percpu_count;
+
+	return __ref_is_percpu(ref, &percpu_count);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_test_is_percpu);
+
+void percpu_ref_test_flush_release_work(void)
+{
+	int max_flush = READ_ONCE(max_scan_count);
+	int max_count = 1000;
+
+	/* Complete any executing release work */
+	flush_delayed_work(&percpu_ref_release_work);
+	/* Scan till the end of the llist */
+	WRITE_ONCE(max_scan_count, INT_MAX);
+	/* max scan count update visible to release work */
+	smp_mb();
+	flush_delayed_work(&percpu_ref_release_work);
+	/* max scan count update visible to release work */
+	smp_mb();
+	WRITE_ONCE(max_scan_count, 1);
+	/* max scan count update visible to work */
+	smp_mb();
+	flush_delayed_work(&percpu_ref_release_work);
+	while (READ_ONCE(last_percpu_ref_node) != NULL && max_count--)
+		flush_delayed_work(&percpu_ref_release_work);
+	/* max scan count update visible to work */
+	smp_mb();
+	WRITE_ONCE(max_scan_count, max_flush);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_test_flush_release_work);
+
 static __init int percpu_ref_setup(void)
 {
 	percpu_ref_release_wq = alloc_workqueue("percpu_ref_release_wq",
diff --git a/lib/percpu-refcount.h b/lib/percpu-refcount.h
new file mode 100644
index 000000000000..be2ac0411194
--- /dev/null
+++ b/lib/percpu-refcount.h
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+#ifndef __LIB_REFCOUNT_H
+#define __LIB_REFCOUNT_H
+bool percpu_ref_test_is_percpu(struct percpu_ref *ref);
+void percpu_ref_test_flush_release_work(void);
+#endif
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 3/6] percpu-refcount: Extend managed mode to allow runtime switching
  2024-09-16  5:08 [RFC 0/6] Managed Percpu Refcount Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 1/6] percpu-refcount: Add managed mode for RCU released objects Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 2/6] percpu-refcount: Add torture test for percpu refcount Neeraj Upadhyay
@ 2024-09-16  5:08 ` Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 4/6] percpu-refcount-torture: Extend test with runtime mode switches Neeraj Upadhyay
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: Neeraj Upadhyay @ 2024-09-16  5:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: john.johansen, paul, jmorris, serge, linux-security-module,
	gautham.shenoy, Santosh.Shukla, Ananth.Narayan,
	Raghavendra.KodsaraThimmappa, paulmck, boqun.feng,
	vinicius.gomes, mjguzik, dennis, tj, cl, linux-mm, rcu

Provide more flexibility in terms of runtime mode switch for a managed
percpu ref. This can be useful for cases when in some scenarios, a
managed ref's object enters shutdown phase. Instead of waiting for
manager thread to process the ref, user can dirctly invoke
percpu_ref_kill() for the ref.

The init modes are same as in existing code. Runtime mode switching
allows switching back a managed ref to unmanaged mode, which allows
transitions to all reinit modes from managed mode.

To -->       A   P   P(RI)   M   D  D(RI)  D(RI/M)   EX   REI   RES

  A          y   n     y     y   n    y       y       y     y     y
  P          n   n     n     n   y    n       n       y     n     n
  M          y*  n     y*    y   n    y*      y       y*    y     y
  P(RI)      y   n     y     y   n    y       y       y     y     y
  D(RI)      y   n     y     y   n    y       y       -     y     y
  D(RI/M)    y*  n     y*    y   n    y*      y       -     y     y

Modes:
A - Atomic  P - PerCPU  M - Managed  P(RI) - PerCPU with ReInit
D(RI) - Dead with ReInit  D(RI/M) - Dead with ReInit and Managed

PerCPU Ref Ops:

KLL - Kill  REI - Reinit  RES - Resurrect

(RI) is for modes which are initialized with PERCPU_REF_ALLOW_REINIT.
The transitions shown above are the allowed transitions and can be
indirect transitions. For example, managed ref switches to P(RI) mode
when percpu_ref_switch_to_unmanaged() is called for it. P(RI) mode
can be directly switched to A mode using percpu_ref_switch_to_atomic().

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 include/linux/percpu-refcount.h |   3 +-
 lib/percpu-refcount.c           | 248 +++++++++++---------------------
 2 files changed, 88 insertions(+), 163 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index e6aea81b3d01..fe967db431a6 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -110,7 +110,7 @@ struct percpu_ref_data {
 	struct rcu_head		rcu;
 	struct percpu_ref	*ref;
 	unsigned int		aux_flags;
-	struct llist_node	node;
+	struct list_head	node;
 
 };
 
@@ -139,6 +139,7 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 void percpu_ref_switch_to_atomic_sync(struct percpu_ref *ref);
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 int percpu_ref_switch_to_managed(struct percpu_ref *ref);
+void percpu_ref_switch_to_unmanaged(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 				 percpu_ref_func_t *confirm_kill);
 void percpu_ref_resurrect(struct percpu_ref *ref);
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 7d0c85c7ce57..b79e36905aa4 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -5,7 +5,7 @@
 #include <linux/sched.h>
 #include <linux/wait.h>
 #include <linux/slab.h>
-#include <linux/llist.h>
+#include <linux/list.h>
 #include <linux/moduleparam.h>
 #include <linux/types.h>
 #include <linux/mm.h>
@@ -43,7 +43,12 @@
 
 static DEFINE_SPINLOCK(percpu_ref_switch_lock);
 static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq);
-static LLIST_HEAD(percpu_ref_manage_head);
+static struct list_head percpu_ref_manage_head = LIST_HEAD_INIT(percpu_ref_manage_head);
+/* Spinlock protects node additions/deletions */
+static DEFINE_SPINLOCK(percpu_ref_manage_lock);
+/* Mutex synchronizes node deletions with the node being scanned */
+static DEFINE_MUTEX(percpu_ref_active_switch_mutex);
+static struct list_head *next_percpu_ref_node = &percpu_ref_manage_head;
 
 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
 {
@@ -112,7 +117,7 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
 	data->confirm_switch = NULL;
 	data->ref = ref;
 	ref->data = data;
-	init_llist_node(&data->node);
+	INIT_LIST_HEAD(&data->node);
 
 	if (flags & PERCPU_REF_REL_MANAGED)
 		percpu_ref_switch_to_managed(ref);
@@ -150,9 +155,9 @@ static int __percpu_ref_switch_to_managed(struct percpu_ref *ref)
 	data->force_atomic = false;
 	if (!__ref_is_percpu(ref, &percpu_count))
 		__percpu_ref_switch_mode(ref, NULL);
-	/* Ensure ordering of percpu mode switch and node scan */
-	smp_mb();
-	llist_add(&data->node, &percpu_ref_manage_head);
+	spin_lock(&percpu_ref_manage_lock);
+	list_add(&data->node, &percpu_ref_manage_head);
+	spin_unlock(&percpu_ref_manage_lock);
 
 	return 0;
 
@@ -162,7 +167,7 @@ static int __percpu_ref_switch_to_managed(struct percpu_ref *ref)
 }
 
 /**
- * percpu_ref_switch_to_managed - Switch an unmanaged ref to percpu mode.
+ * percpu_ref_switch_to_managed - Switch an unmanaged ref to percpu managed mode.
  *
  * @ref: percpu_ref to switch to managed mode
  *
@@ -179,6 +184,47 @@ int percpu_ref_switch_to_managed(struct percpu_ref *ref)
 }
 EXPORT_SYMBOL_GPL(percpu_ref_switch_to_managed);
 
+/**
+ * percpu_ref_switch_to_unmanaged - Switch a managed ref to percpu mode.
+ *
+ * @ref: percpu_ref to switch back to unmanaged percpu mode
+ *
+ * Must only be called with elevated refcount.
+ */
+void percpu_ref_switch_to_unmanaged(struct percpu_ref *ref)
+{
+	bool mutex_taken = false;
+	struct list_head *node;
+	unsigned long flags;
+
+	might_sleep();
+
+	WARN_ONCE(!percpu_ref_is_managed(ref), "Percpu ref is not managed");
+
+	node = &ref->data->node;
+	spin_lock(&percpu_ref_manage_lock);
+	if (list_empty(node)) {
+		spin_unlock(&percpu_ref_manage_lock);
+		mutex_taken = true;
+		mutex_lock(&percpu_ref_active_switch_mutex);
+		spin_lock(&percpu_ref_manage_lock);
+	}
+
+	if (next_percpu_ref_node == node)
+		next_percpu_ref_node = next_percpu_ref_node->next;
+	list_del_init(node);
+	spin_unlock(&percpu_ref_manage_lock);
+	if (mutex_taken)
+		mutex_unlock(&percpu_ref_active_switch_mutex);
+
+	/* Drop the pseudo-init reference */
+	percpu_ref_put(ref);
+	spin_lock_irqsave(&percpu_ref_switch_lock, flags);
+	ref->data->aux_flags &= ~__PERCPU_REL_MANAGED;
+	spin_unlock_irqrestore(&percpu_ref_switch_lock, flags);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_switch_to_unmanaged);
+
 static void __percpu_ref_exit(struct percpu_ref *ref)
 {
 	unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
@@ -599,164 +645,35 @@ module_param(max_scan_count, int, 0444);
 
 static void percpu_ref_release_work_fn(struct work_struct *work);
 
-/*
- * Sentinel llist nodes for lockless list traveral and deletions by
- * the pcpu ref release worker, while nodes are added from
- * percpu_ref_init() and percpu_ref_switch_to_managed().
- *
- * Sentinel node marks the head of list traversal for the current
- * iteration of kworker execution.
- */
-struct percpu_ref_sen_node {
-	bool inuse;
-	struct llist_node node;
-};
-
-/*
- * We need two sentinel nodes for lockless list manipulations from release
- * worker - first node will be used in current reclaim iteration. The second
- * node will be used in next iteration. Next iteration marks the first node
- * as free, for use in subsequent iteration.
- */
-#define PERCPU_REF_SEN_NODES_COUNT     2
-
-/* Track last processed percpu ref node */
-static struct llist_node *last_percpu_ref_node;
-
-static struct percpu_ref_sen_node
-	percpu_ref_sen_nodes[PERCPU_REF_SEN_NODES_COUNT];
-
 static DECLARE_DELAYED_WORK(percpu_ref_release_work, percpu_ref_release_work_fn);
 
-static bool percpu_ref_is_sen_node(struct llist_node *node)
-{
-	return &percpu_ref_sen_nodes[0].node <= node &&
-		node <= &percpu_ref_sen_nodes[PERCPU_REF_SEN_NODES_COUNT - 1].node;
-}
-
-static struct llist_node *percpu_ref_get_sen_node(void)
-{
-	int i;
-	struct percpu_ref_sen_node *sn;
-
-	for (i = 0; i < PERCPU_REF_SEN_NODES_COUNT; i++) {
-		sn = &percpu_ref_sen_nodes[i];
-		if (!sn->inuse) {
-			sn->inuse = true;
-			return &sn->node;
-		}
-	}
-
-	return NULL;
-}
-
-static void percpu_ref_put_sen_node(struct llist_node *node)
-{
-	struct percpu_ref_sen_node *sn = container_of(node, struct percpu_ref_sen_node, node);
-
-	sn->inuse = false;
-	init_llist_node(node);
-}
-
-static void percpu_ref_put_all_sen_nodes_except(struct llist_node *node)
-{
-	int i;
-
-	for (i = 0; i < PERCPU_REF_SEN_NODES_COUNT; i++) {
-		if (&percpu_ref_sen_nodes[i].node == node)
-			continue;
-		percpu_ref_sen_nodes[i].inuse = false;
-		init_llist_node(&percpu_ref_sen_nodes[i].node);
-	}
-}
-
 static struct workqueue_struct *percpu_ref_release_wq;
 
 static void percpu_ref_release_work_fn(struct work_struct *work)
 {
-	struct llist_node *pos, *first, *head, *prev, *next;
-	struct llist_node *sen_node;
+	struct list_head *node;
 	struct percpu_ref *ref;
 	int count = 0;
 	bool held;
-	struct llist_node *last_node = READ_ONCE(last_percpu_ref_node);
 
-	first = READ_ONCE(percpu_ref_manage_head.first);
-	if (!first)
+	mutex_lock(&percpu_ref_active_switch_mutex);
+	spin_lock(&percpu_ref_manage_lock);
+	if (list_empty(&percpu_ref_manage_head)) {
+		next_percpu_ref_node = &percpu_ref_manage_head;
+		spin_unlock(&percpu_ref_manage_lock);
+		mutex_unlock(&percpu_ref_active_switch_mutex);
 		goto queue_release_work;
-
-	/*
-	 * Enqueue a dummy node to mark the start of scan. This dummy
-	 * node is used as start point of scan and ensures that
-	 * there is no additional synchronization required with new
-	 * label node additions to the llist. Any new labels will
-	 * be processed in next run of the kworker.
-	 *
-	 *                SCAN START PTR
-	 *                     |
-	 *                     v
-	 * +----------+     +------+    +------+    +------+
-	 * |          |     |      |    |      |    |      |
-	 * |   head   ------> dummy|--->|label |--->| label|--->NULL
-	 * |          |     | node |    |      |    |      |
-	 * +----------+     +------+    +------+    +------+
-	 *
-	 *
-	 * New label addition:
-	 *
-	 *                       SCAN START PTR
-	 *                            |
-	 *                            v
-	 * +----------+  +------+  +------+    +------+    +------+
-	 * |          |  |      |  |      |    |      |    |      |
-	 * |   head   |--> label|--> dummy|--->|label |--->| label|--->NULL
-	 * |          |  |      |  | node |    |      |    |      |
-	 * +----------+  +------+  +------+    +------+    +------+
-	 *
-	 */
-	if (last_node == NULL || last_node->next == NULL) {
-retry_sentinel_get:
-		sen_node = percpu_ref_get_sen_node();
-		/*
-		 * All sentinel nodes are in use? This should not happen, as we
-		 * require only one sentinel for the start of list traversal and
-		 * other sentinel node is freed during the traversal.
-		 */
-		if (WARN_ONCE(!sen_node, "All sentinel nodes are in use")) {
-			/* Use first node as the sentinel node */
-			head = first->next;
-			if (!head) {
-				struct llist_node *ign_node = NULL;
-				/*
-				 * We exhausted sentinel nodes. However, there aren't
-				 * enough nodes in the llist. So, we have leaked
-				 * sentinel nodes. Reclaim sentinels and retry.
-				 */
-				if (percpu_ref_is_sen_node(first))
-					ign_node = first;
-				percpu_ref_put_all_sen_nodes_except(ign_node);
-				goto retry_sentinel_get;
-			}
-			prev = first;
-		} else {
-			llist_add(sen_node, &percpu_ref_manage_head);
-			prev = sen_node;
-			head = prev->next;
-		}
-	} else {
-		prev = last_node;
-		head = prev->next;
 	}
+	if (next_percpu_ref_node == &percpu_ref_manage_head)
+		node = percpu_ref_manage_head.next;
+	else
+		node = next_percpu_ref_node;
+	next_percpu_ref_node = node->next;
+	list_del_init(node);
+	spin_unlock(&percpu_ref_manage_lock);
 
-	llist_for_each_safe(pos, next, head) {
-		/* Free sentinel node which is present in the list */
-		if (percpu_ref_is_sen_node(pos)) {
-			prev->next = pos->next;
-			percpu_ref_put_sen_node(pos);
-			continue;
-		}
-
-		ref = container_of(pos, struct percpu_ref_data, node)->ref;
+	while (!list_is_head(node, &percpu_ref_manage_head)) {
+		ref = container_of(node, struct percpu_ref_data, node)->ref;
 		__percpu_ref_switch_to_atomic_sync_checked(ref, false);
 		/*
 		 * Drop the ref while in RCU read critical section to
@@ -765,24 +682,31 @@ static void percpu_ref_release_work_fn(struct work_struct *work)
 		rcu_read_lock();
 		percpu_ref_put(ref);
 		held = percpu_ref_tryget(ref);
-		if (!held) {
-			prev->next = pos->next;
-			init_llist_node(pos);
+		if (held) {
+			spin_lock(&percpu_ref_manage_lock);
+			list_add(node, &percpu_ref_manage_head);
+			spin_unlock(&percpu_ref_manage_lock);
+			__percpu_ref_switch_to_percpu_checked(ref, false);
+		} else {
 			ref->percpu_count_ptr |= __PERCPU_REF_DEAD;
 		}
 		rcu_read_unlock();
-		if (!held)
-			continue;
-		__percpu_ref_switch_to_percpu_checked(ref, false);
+		mutex_unlock(&percpu_ref_active_switch_mutex);
 		count++;
-		if (count == READ_ONCE(max_scan_count)) {
-			WRITE_ONCE(last_percpu_ref_node, pos);
+		if (count == READ_ONCE(max_scan_count))
 			goto queue_release_work;
+		mutex_lock(&percpu_ref_active_switch_mutex);
+		spin_lock(&percpu_ref_manage_lock);
+		node = next_percpu_ref_node;
+		if (!list_is_head(next_percpu_ref_node, &percpu_ref_manage_head)) {
+			next_percpu_ref_node = next_percpu_ref_node->next;
+			list_del_init(node);
 		}
-		prev = pos;
+		spin_unlock(&percpu_ref_manage_lock);
 	}
 
-	WRITE_ONCE(last_percpu_ref_node, NULL);
+	mutex_unlock(&percpu_ref_active_switch_mutex);
+
 queue_release_work:
 	queue_delayed_work(percpu_ref_release_wq, &percpu_ref_release_work,
 			   scan_interval);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 4/6] percpu-refcount-torture: Extend test with runtime mode switches
  2024-09-16  5:08 [RFC 0/6] Managed Percpu Refcount Neeraj Upadhyay
                   ` (2 preceding siblings ...)
  2024-09-16  5:08 ` [RFC 3/6] percpu-refcount: Extend managed mode to allow runtime switching Neeraj Upadhyay
@ 2024-09-16  5:08 ` Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 5/6] apparmor: Switch labels to percpu refcount in atomic mode Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 6/6] apparmor: Switch labels to percpu ref managed mode Neeraj Upadhyay
  5 siblings, 0 replies; 8+ messages in thread
From: Neeraj Upadhyay @ 2024-09-16  5:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: john.johansen, paul, jmorris, serge, linux-security-module,
	gautham.shenoy, Santosh.Shukla, Ananth.Narayan,
	Raghavendra.KodsaraThimmappa, paulmck, boqun.feng,
	vinicius.gomes, mjguzik, dennis, tj, cl, linux-mm, rcu

Extend the test to exercise runtime switching from managed
mode to other reinitable active modes.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 lib/percpu-refcount-torture.c | 41 +++++++++++++++++++++++++++++++++--
 lib/percpu-refcount.c         | 12 +++++++++-
 2 files changed, 50 insertions(+), 3 deletions(-)

diff --git a/lib/percpu-refcount-torture.c b/lib/percpu-refcount-torture.c
index 686f5a228b40..cb2700b16517 100644
--- a/lib/percpu-refcount-torture.c
+++ b/lib/percpu-refcount-torture.c
@@ -3,6 +3,7 @@
 #include <linux/jiffies.h>
 #include <linux/module.h>
 #include <linux/moduleparam.h>
+#include <linux/mutex.h>
 #include <linux/percpu-refcount.h>
 #include <linux/torture.h>
 
@@ -59,6 +60,7 @@ static struct task_struct **busted_late_release_tasks;
 
 static struct percpu_ref *refs;
 static long *num_per_ref_users;
+static struct mutex *ref_switch_mutexes;
 
 static atomic_t running;
 static atomic_t *ref_running;
@@ -97,19 +99,36 @@ static int percpu_ref_manager_thread(void *data)
 static int percpu_ref_test_thread(void *data)
 {
 	struct percpu_ref *ref = (struct percpu_ref *)data;
+	DEFINE_TORTURE_RANDOM(rand);
+	int ref_idx = ref - refs;
+	int do_switch;
 	int i = 0;
 
 	percpu_ref_get(ref);
 
 	do {
 		percpu_ref_get(ref);
+		/* Perform checks once per 256 iterations */
+		do_switch = (torture_random(&rand) & 0xff);
 		udelay(delay_us);
+		if (do_switch) {
+			mutex_lock(&ref_switch_mutexes[ref_idx]);
+			percpu_ref_switch_to_unmanaged(ref);
+			udelay(delay_us);
+			percpu_ref_switch_to_atomic_sync(ref);
+			if (do_switch & 1)
+				percpu_ref_switch_to_percpu(ref);
+			udelay(delay_us);
+			percpu_ref_switch_to_managed(ref);
+			mutex_unlock(&ref_switch_mutexes[ref_idx]);
+			udelay(delay_us);
+		}
 		percpu_ref_put(ref);
 		stutter_wait("percpu_ref_test_thread");
 		i++;
 	} while (i < niterations);
 
-	atomic_dec(&ref_running[ref - refs]);
+	atomic_dec(&ref_running[ref_idx]);
 	/* Order ref release with ref_running[ref_idx] == 0 */
 	smp_mb();
 	percpu_ref_put(ref);
@@ -213,6 +232,13 @@ static void percpu_ref_test_cleanup(void)
 	kfree(num_per_ref_users);
 	num_per_ref_users = NULL;
 
+	if (ref_switch_mutexes) {
+		for (i = 0; i < nrefs; i++)
+			mutex_destroy(&ref_switch_mutexes[i]);
+		kfree(ref_switch_mutexes);
+		ref_switch_mutexes = NULL;
+	}
+
 	if (refs) {
 		for (i = 0; i < nrefs; i++)
 			percpu_ref_exit(&refs[i]);
@@ -251,7 +277,8 @@ static int __init percpu_ref_torture_init(void)
 		goto init_err;
 	}
 	for (i = 0; i < nrefs; i++) {
-		flags = torture_random(trsp) & 1 ? PERCPU_REF_INIT_ATOMIC : PERCPU_REF_REL_MANAGED;
+		flags = (torture_random(trsp) & 1) ? PERCPU_REF_INIT_ATOMIC :
+							PERCPU_REF_REL_MANAGED;
 		err = percpu_ref_init(&refs[i], percpu_ref_test_release,
 				      flags, GFP_KERNEL);
 		if (err)
@@ -269,6 +296,16 @@ static int __init percpu_ref_torture_init(void)
 	for (i = 0; i < nrefs; i++)
 		num_per_ref_users[i] = 0;
 
+	ref_switch_mutexes = kcalloc(nrefs, sizeof(ref_switch_mutexes[0]), GFP_KERNEL);
+	if (!ref_switch_mutexes) {
+		TOROUT_ERRSTRING("out of memory");
+		err = -ENOMEM;
+		goto init_err;
+	}
+
+	for (i = 0; i < nrefs; i++)
+		mutex_init(&ref_switch_mutexes[i]);
+
 	ref_user_tasks = kcalloc(nusers, sizeof(ref_user_tasks[0]), GFP_KERNEL);
 	if (!ref_user_tasks) {
 		TOROUT_ERRSTRING("out of memory");
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index b79e36905aa4..4e0a453bd51f 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -723,6 +723,7 @@ EXPORT_SYMBOL_GPL(percpu_ref_test_is_percpu);
 void percpu_ref_test_flush_release_work(void)
 {
 	int max_flush = READ_ONCE(max_scan_count);
+	struct list_head *next;
 	int max_count = 1000;
 
 	/* Complete any executing release work */
@@ -738,8 +739,17 @@ void percpu_ref_test_flush_release_work(void)
 	/* max scan count update visible to work */
 	smp_mb();
 	flush_delayed_work(&percpu_ref_release_work);
-	while (READ_ONCE(last_percpu_ref_node) != NULL && max_count--)
+
+	while (true) {
+		if (!max_count--)
+			break;
+		spin_lock(&percpu_ref_manage_lock);
+		next = next_percpu_ref_node;
+		spin_unlock(&percpu_ref_manage_lock);
+		if (list_is_head(next, &percpu_ref_manage_head))
+			break;
 		flush_delayed_work(&percpu_ref_release_work);
+	}
 	/* max scan count update visible to work */
 	smp_mb();
 	WRITE_ONCE(max_scan_count, max_flush);
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 5/6] apparmor: Switch labels to percpu refcount in atomic mode
  2024-09-16  5:08 [RFC 0/6] Managed Percpu Refcount Neeraj Upadhyay
                   ` (3 preceding siblings ...)
  2024-09-16  5:08 ` [RFC 4/6] percpu-refcount-torture: Extend test with runtime mode switches Neeraj Upadhyay
@ 2024-09-16  5:08 ` Neeraj Upadhyay
  2024-09-16  5:08 ` [RFC 6/6] apparmor: Switch labels to percpu ref managed mode Neeraj Upadhyay
  5 siblings, 0 replies; 8+ messages in thread
From: Neeraj Upadhyay @ 2024-09-16  5:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: john.johansen, paul, jmorris, serge, linux-security-module,
	gautham.shenoy, Santosh.Shukla, Ananth.Narayan,
	Raghavendra.KodsaraThimmappa, paulmck, boqun.feng,
	vinicius.gomes, mjguzik, dennis, tj, cl, linux-mm, rcu

In preparation of using percpu refcount for labels, replace label kref
with percpu ref. The percpu ref is initialized to atomic mode, as
using percpu mode requires tracking ref kill points. As the atomic
counter is in a different cacheline now, rearrange some of the fields
in aa_label struct - flags, proxy; to optimize some of the fast paths
for unconfined labels.

In addition to the requirement to cleanup the percpu ref using
percpu_ref_exit() in label destruction path, other potential impact
from this patch could be:

- Increase in memory requirement (for per cpu counters) for each label.

- Displacement of aa_label struct members to different cacheline, as
  percpu ref takes 2 pointers space.

- Moving of the atomic counter outside of the cacheline of the aa_label
  struct.

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---
 security/apparmor/include/label.h  | 16 ++++++++--------
 security/apparmor/include/policy.h |  8 ++++----
 security/apparmor/label.c          | 11 ++++++++---
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/security/apparmor/include/label.h b/security/apparmor/include/label.h
index 2a72e6b17d68..4b29a4679c74 100644
--- a/security/apparmor/include/label.h
+++ b/security/apparmor/include/label.h
@@ -121,12 +121,12 @@ struct label_it {
  * @ent: set of profiles for label, actual size determined by @size
  */
 struct aa_label {
-	struct kref count;
+	struct percpu_ref count;
+	long flags;
+	struct aa_proxy *proxy;
 	struct rb_node node;
 	struct rcu_head rcu;
-	struct aa_proxy *proxy;
 	__counted char *hname;
-	long flags;
 	u32 secid;
 	int size;
 	struct aa_profile *vec[];
@@ -276,7 +276,7 @@ void __aa_labelset_update_subtree(struct aa_ns *ns);
 
 void aa_label_destroy(struct aa_label *label);
 void aa_label_free(struct aa_label *label);
-void aa_label_kref(struct kref *kref);
+void aa_label_percpu_ref(struct percpu_ref *ref);
 bool aa_label_init(struct aa_label *label, int size, gfp_t gfp);
 struct aa_label *aa_label_alloc(int size, struct aa_proxy *proxy, gfp_t gfp);
 
@@ -373,7 +373,7 @@ int aa_label_match(struct aa_profile *profile, struct aa_ruleset *rules,
  */
 static inline struct aa_label *__aa_get_label(struct aa_label *l)
 {
-	if (l && kref_get_unless_zero(&l->count))
+	if (l && percpu_ref_tryget(&l->count))
 		return l;
 
 	return NULL;
@@ -382,7 +382,7 @@ static inline struct aa_label *__aa_get_label(struct aa_label *l)
 static inline struct aa_label *aa_get_label(struct aa_label *l)
 {
 	if (l)
-		kref_get(&(l->count));
+		percpu_ref_get(&(l->count));
 
 	return l;
 }
@@ -402,7 +402,7 @@ static inline struct aa_label *aa_get_label_rcu(struct aa_label __rcu **l)
 	rcu_read_lock();
 	do {
 		c = rcu_dereference(*l);
-	} while (c && !kref_get_unless_zero(&c->count));
+	} while (c && !percpu_ref_tryget(&c->count));
 	rcu_read_unlock();
 
 	return c;
@@ -442,7 +442,7 @@ static inline struct aa_label *aa_get_newest_label(struct aa_label *l)
 static inline void aa_put_label(struct aa_label *l)
 {
 	if (l)
-		kref_put(&l->count, aa_label_kref);
+		percpu_ref_put(&l->count);
 }
 
 
diff --git a/security/apparmor/include/policy.h b/security/apparmor/include/policy.h
index 75088cc310b6..5849b6b94cea 100644
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@ -329,7 +329,7 @@ static inline aa_state_t ANY_RULE_MEDIATES(struct list_head *head,
 static inline struct aa_profile *aa_get_profile(struct aa_profile *p)
 {
 	if (p)
-		kref_get(&(p->label.count));
+		percpu_ref_get(&(p->label.count));
 
 	return p;
 }
@@ -343,7 +343,7 @@ static inline struct aa_profile *aa_get_profile(struct aa_profile *p)
  */
 static inline struct aa_profile *aa_get_profile_not0(struct aa_profile *p)
 {
-	if (p && kref_get_unless_zero(&p->label.count))
+	if (p && percpu_ref_tryget(&p->label.count))
 		return p;
 
 	return NULL;
@@ -363,7 +363,7 @@ static inline struct aa_profile *aa_get_profile_rcu(struct aa_profile __rcu **p)
 	rcu_read_lock();
 	do {
 		c = rcu_dereference(*p);
-	} while (c && !kref_get_unless_zero(&c->label.count));
+	} while (c && !percpu_ref_tryget(&c->label.count));
 	rcu_read_unlock();
 
 	return c;
@@ -376,7 +376,7 @@ static inline struct aa_profile *aa_get_profile_rcu(struct aa_profile __rcu **p)
 static inline void aa_put_profile(struct aa_profile *p)
 {
 	if (p)
-		kref_put(&p->label.count, aa_label_kref);
+		percpu_ref_put(&p->label.count);
 }
 
 static inline int AUDIT_MODE(struct aa_profile *profile)
diff --git a/security/apparmor/label.c b/security/apparmor/label.c
index c71e4615dd46..aa9e6eac3ecc 100644
--- a/security/apparmor/label.c
+++ b/security/apparmor/label.c
@@ -336,6 +336,7 @@ void aa_label_destroy(struct aa_label *label)
 			rcu_assign_pointer(label->proxy->label, NULL);
 		aa_put_proxy(label->proxy);
 	}
+	percpu_ref_exit(&label->count);
 	aa_free_secid(label->secid);
 
 	label->proxy = (struct aa_proxy *) PROXY_POISON + 1;
@@ -369,9 +370,9 @@ static void label_free_rcu(struct rcu_head *head)
 	label_free_switch(label);
 }
 
-void aa_label_kref(struct kref *kref)
+void aa_label_percpu_ref(struct percpu_ref *ref)
 {
-	struct aa_label *label = container_of(kref, struct aa_label, count);
+	struct aa_label *label = container_of(ref, struct aa_label, count);
 	struct aa_ns *ns = labels_ns(label);
 
 	if (!ns) {
@@ -408,7 +409,11 @@ bool aa_label_init(struct aa_label *label, int size, gfp_t gfp)
 
 	label->size = size;			/* doesn't include null */
 	label->vec[size] = NULL;		/* null terminate */
-	kref_init(&label->count);
+	if (percpu_ref_init(&label->count, aa_label_percpu_ref, PERCPU_REF_INIT_ATOMIC, gfp)) {
+		aa_free_secid(label->secid);
+		return false;
+	}
+
 	RB_CLEAR_NODE(&label->node);
 
 	return true;
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC 6/6] apparmor: Switch labels to percpu ref managed mode
  2024-09-16  5:08 [RFC 0/6] Managed Percpu Refcount Neeraj Upadhyay
                   ` (4 preceding siblings ...)
  2024-09-16  5:08 ` [RFC 5/6] apparmor: Switch labels to percpu refcount in atomic mode Neeraj Upadhyay
@ 2024-09-16  5:08 ` Neeraj Upadhyay
  2024-09-18  5:44   ` kernel test robot
  5 siblings, 1 reply; 8+ messages in thread
From: Neeraj Upadhyay @ 2024-09-16  5:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: john.johansen, paul, jmorris, serge, linux-security-module,
	gautham.shenoy, Santosh.Shukla, Ananth.Narayan,
	Raghavendra.KodsaraThimmappa, paulmck, boqun.feng,
	vinicius.gomes, mjguzik, dennis, tj, cl, linux-mm, rcu

Nginx performance testing with Apparmor enabled (with Nginx running in
unconfined profile), on kernel versions 6.1 and 6.5 show significant
drop in throughput scalability when Nginx workers are scaled to use
higher number of CPUs across various L3 cache domains.

Below is one sample data on the throughput scalability loss, based on
results on AMD Zen4 system with 96 CPUs with SMT core count 2:

Config      Cache Domains     apparmor=off        apparmor=on
                             scaling eff (%)      scaling eff (%)
8C16T          1                  100%             100%
16C32T         2                   95%              94%
24C48T         3                   94%              93%
48C96T         6                   92%              88%
96C192T        12                  85%              68%

There is a significant drop in scaling efficiency for 96 cores/192 SMT
threads.

Perf tool shows most of the contention coming from below places:
6.56%     nginx  [kernel.vmlinux]      [k] apparmor_current_getsecid_subj
6.22%     nginx  [kernel.vmlinux]      [k] apparmor_file_open

The majority of the CPU cycles is found to be due to memory contention
in atomic_fetch_add and atomic_fetch_sub operations from kref_get() and
kref_put() operations on AppArmor labels.

A part of the contention was fixed with commit 2516fde1fa00 ("apparmor:
Optimize retrieving current task secid"). After including this commit, the
scaling efficiency improved to below:

Config      Cache Domains     apparmor=on        apparmor=on (patched)
                             scaling eff (%)      scaling eff (%)
8C16T          1                  100%             100%
16C32T         2                   97%              93%
24C48T         3                   94%              92%
48C96T         6                   88%              88%
96C192T        12                  65%              79%

However, the scaling efficiency impact is still significant even after
including the commit. Also, the performance impact is even higher for
>192 CPUs. In addition, the memory contention impact would increase
when there is a high frequency of label update operations and labels
are marked stale more frequently.

Use the new percpu managed mode for tracking release of all Apparmor
labels. Using percpu refcount for Apparmor label's refcounting improves
throughput scalability for Nginx:

Config      Cache Domains     apparmor=on (percpuref)
                              scaling eff (%)
8C16T          1                  100%
16C32T         2                   96%
24C48T         3                   94%
48C96T         6                   93%
96C192T        12                  90%

Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com>
---

The apparmor_file_open() refcount contention has been resolved recently
with commit f4fee216df7d ("apparmor: try to avoid refing the label in
apparmor_file_open"). I have posted this series to get feedback on the
approach to improve refcount scalability within apparmor subsystem.


 security/apparmor/label.c     | 1 +
 security/apparmor/policy_ns.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/security/apparmor/label.c b/security/apparmor/label.c
index aa9e6eac3ecc..016a45a180b1 100644
--- a/security/apparmor/label.c
+++ b/security/apparmor/label.c
@@ -710,6 +710,7 @@ static struct aa_label *__label_insert(struct aa_labelset *ls,
 	rb_link_node(&label->node, parent, new);
 	rb_insert_color(&label->node, &ls->root);
 	label->flags |= FLAG_IN_TREE;
+	percpu_ref_switch_to_managed(&label->count);
 
 	return aa_get_label(label);
 }
diff --git a/security/apparmor/policy_ns.c b/security/apparmor/policy_ns.c
index 1f02cfe1d974..18eb58b68a60 100644
--- a/security/apparmor/policy_ns.c
+++ b/security/apparmor/policy_ns.c
@@ -124,6 +124,7 @@ static struct aa_ns *alloc_ns(const char *prefix, const char *name)
 		goto fail_unconfined;
 	/* ns and ns->unconfined share ns->unconfined refcount */
 	ns->unconfined->ns = ns;
+	percpu_ref_switch_to_managed(&ns->unconfined->label.count);
 
 	atomic_set(&ns->uniq_null, 0);
 
@@ -377,6 +378,7 @@ int __init aa_alloc_root_ns(void)
 	}
 	kernel_t = &kernel_p->label;
 	root_ns->unconfined->ns = aa_get_ns(root_ns);
+	percpu_ref_switch_to_managed(&root_ns->unconfined->label.count);
 
 	return 0;
 }
-- 
2.34.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC 6/6] apparmor: Switch labels to percpu ref managed mode
  2024-09-16  5:08 ` [RFC 6/6] apparmor: Switch labels to percpu ref managed mode Neeraj Upadhyay
@ 2024-09-18  5:44   ` kernel test robot
  0 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2024-09-18  5:44 UTC (permalink / raw)
  To: Neeraj Upadhyay
  Cc: oe-lkp, lkp, apparmor, linux-kernel, john.johansen, paul,
	jmorris, serge, linux-security-module, gautham.shenoy,
	Santosh.Shukla, Ananth.Narayan, Raghavendra.KodsaraThimmappa,
	paulmck, boqun.feng, vinicius.gomes, mjguzik, dennis, tj, cl,
	linux-mm, rcu, oliver.sang



Hello,

kernel test robot noticed "WARNING:at_lib/percpu-refcount.c:#__percpu_ref_switch_to_managed" on:

commit: 59b177cbdc4908a728329b8eec742969a2285979 ("[RFC 6/6] apparmor: Switch labels to percpu ref managed mode")
url: https://github.com/intel-lab-lkp/linux/commits/Neeraj-Upadhyay/percpu-refcount-Add-managed-mode-for-RCU-released-objects/20240916-131210
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-nonmm-unstable
patch link: https://lore.kernel.org/all/20240916050811.473556-7-Neeraj.Upadhyay@amd.com/
patch subject: [RFC 6/6] apparmor: Switch labels to percpu ref managed mode

in testcase: boot

compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)


+------------------------------------------------------------------+------------+------------+
|                                                                  | 124f137c55 | 59b177cbdc |
+------------------------------------------------------------------+------------+------------+
| WARNING:at_lib/percpu-refcount.c:#__percpu_ref_switch_to_managed | 0          | 13         |
| RIP:__percpu_ref_switch_to_managed                               | 0          | 13         |
+------------------------------------------------------------------+------------+------------+


If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202409181358.c07681be-oliver.sang@intel.com


[   13.041399][    T0] ------------[ cut here ]------------
[   13.042302][    T0] Percpu ref is already managed
[ 13.042302][ T0] WARNING: CPU: 0 PID: 0 at lib/percpu-refcount.c:151 __percpu_ref_switch_to_managed (lib/percpu-refcount.c:151 (discriminator 3)) 
[   13.042302][    T0] Modules linked in:
[   13.042302][    T0] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.11.0-rc6-00143-g59b177cbdc49 #12
[   13.042302][    T0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 13.042302][ T0] RIP: 0010:__percpu_ref_switch_to_managed (lib/percpu-refcount.c:151 (discriminator 3)) 
[ 13.042302][ T0] Code: 03 75 58 65 48 ff 08 e8 c1 7c ee fe eb b1 80 3d 5d a2 ca 03 00 75 c5 48 c7 c7 c0 61 8f 91 c6 05 4d a2 ca 03 01 e8 03 af ce fe <0f> 0b eb ae 31 f6 48 89 df e8 05 f7 ff ff e9 54 fe ff ff e8 4b aa
All code
========
   0:	03 75 58             	add    0x58(%rbp),%esi
   3:	65 48 ff 08          	decq   %gs:(%rax)
   7:	e8 c1 7c ee fe       	callq  0xfffffffffeee7ccd
   c:	eb b1                	jmp    0xffffffffffffffbf
   e:	80 3d 5d a2 ca 03 00 	cmpb   $0x0,0x3caa25d(%rip)        # 0x3caa272
  15:	75 c5                	jne    0xffffffffffffffdc
  17:	48 c7 c7 c0 61 8f 91 	mov    $0xffffffff918f61c0,%rdi
  1e:	c6 05 4d a2 ca 03 01 	movb   $0x1,0x3caa24d(%rip)        # 0x3caa272
  25:	e8 03 af ce fe       	callq  0xfffffffffeceaf2d
  2a:*	0f 0b                	ud2    		<-- trapping instruction
  2c:	eb ae                	jmp    0xffffffffffffffdc
  2e:	31 f6                	xor    %esi,%esi
  30:	48 89 df             	mov    %rbx,%rdi
  33:	e8 05 f7 ff ff       	callq  0xfffffffffffff73d
  38:	e9 54 fe ff ff       	jmpq   0xfffffffffffffe91
  3d:	e8                   	.byte 0xe8
  3e:	4b aa                	rex.WXB stos %al,%es:(%rdi)

Code starting with the faulting instruction
===========================================
   0:	0f 0b                	ud2    
   2:	eb ae                	jmp    0xffffffffffffffb2
   4:	31 f6                	xor    %esi,%esi
   6:	48 89 df             	mov    %rbx,%rdi
   9:	e8 05 f7 ff ff       	callq  0xfffffffffffff713
   e:	e9 54 fe ff ff       	jmpq   0xfffffffffffffe67
  13:	e8                   	.byte 0xe8
  14:	4b aa                	rex.WXB stos %al,%es:(%rdi)
[   13.042302][    T0] RSP: 0000:ffffffff92407e00 EFLAGS: 00010082
[   13.042302][    T0] RAX: 0000000000000000 RBX: ffff8881008a6500 RCX: 1ffffffff24a3780
[   13.042302][    T0] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
[   13.042302][    T0] RBP: ffff8881008a6508 R08: 0000000000000000 R09: fffffbfff24a3780
[   13.042302][    T0] R10: ffffffff9251bc03 R11: 0000000000000001 R12: ffff8881001f3500
[   13.042302][    T0] R13: ffff8881001f3500 R14: 0000000000000005 R15: 00000000000147b0
[   13.042302][    T0] FS:  0000000000000000(0000) GS:ffff8883a8400000(0000) knlGS:0000000000000000
[   13.042302][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   13.042302][    T0] CR2: ffff88843ffff000 CR3: 00000003b6e62000 CR4: 00000000000006f0
[   13.042302][    T0] Call Trace:
[   13.042302][    T0]  <TASK>
[ 13.042302][ T0] ? __warn (kernel/panic.c:741) 
[ 13.042302][ T0] ? __percpu_ref_switch_to_managed (lib/percpu-refcount.c:151 (discriminator 3)) 
[ 13.042302][ T0] ? report_bug (lib/bug.c:180 lib/bug.c:219) 
[ 13.042302][ T0] ? handle_bug (arch/x86/kernel/traps.c:239) 
[ 13.042302][ T0] ? exc_invalid_op (arch/x86/kernel/traps.c:260 (discriminator 1)) 
[ 13.042302][ T0] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) 
[ 13.042302][ T0] ? __percpu_ref_switch_to_managed (lib/percpu-refcount.c:151 (discriminator 3)) 
[ 13.042302][ T0] percpu_ref_switch_to_managed (include/linux/spinlock.h:406 lib/percpu-refcount.c:182) 
[ 13.042302][ T0] aa_alloc_root_ns (security/apparmor/policy_ns.c:383) 
[ 13.042302][ T0] ? aa_setup_dfa_engine (security/apparmor/lsm.c:2194) 
[ 13.042302][ T0] apparmor_init (security/apparmor/lsm.c:2235) 
[ 13.042302][ T0] initialize_lsm (security/security.c:263 (discriminator 3)) 
[ 13.042302][ T0] ordered_lsm_init (security/security.c:422 (discriminator 3)) 
[ 13.042302][ T0] security_init (security/security.c:475) 
[ 13.042302][ T0] start_kernel (init/main.c:1085) 
[ 13.042302][ T0] x86_64_start_reservations (arch/x86/kernel/head64.c:495) 
[ 13.042302][ T0] x86_64_start_kernel (arch/x86/kernel/head64.c:437 (discriminator 17)) 
[ 13.042302][ T0] common_startup_64 (arch/x86/kernel/head_64.S:421) 
[   13.042302][    T0]  </TASK>
[   13.042302][    T0] ---[ end trace 0000000000000000 ]---
[   13.044823][    T0] AppArmor: AppArmor initialized
[   13.062176][    T0] Mount-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)
[   13.065846][    T0] Mountpoint-cache hash table entries: 32768 (order: 6, 262144 bytes, linear)



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20240918/202409181358.c07681be-oliver.sang@intel.com



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-09-18  5:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-16  5:08 [RFC 0/6] Managed Percpu Refcount Neeraj Upadhyay
2024-09-16  5:08 ` [RFC 1/6] percpu-refcount: Add managed mode for RCU released objects Neeraj Upadhyay
2024-09-16  5:08 ` [RFC 2/6] percpu-refcount: Add torture test for percpu refcount Neeraj Upadhyay
2024-09-16  5:08 ` [RFC 3/6] percpu-refcount: Extend managed mode to allow runtime switching Neeraj Upadhyay
2024-09-16  5:08 ` [RFC 4/6] percpu-refcount-torture: Extend test with runtime mode switches Neeraj Upadhyay
2024-09-16  5:08 ` [RFC 5/6] apparmor: Switch labels to percpu refcount in atomic mode Neeraj Upadhyay
2024-09-16  5:08 ` [RFC 6/6] apparmor: Switch labels to percpu ref managed mode Neeraj Upadhyay
2024-09-18  5:44   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox