On 5/1/24 13:22, Jesper Dangaard Brouer wrote:
>
>
> On 01/05/2024 16.24, Waiman Long wrote:
>> On 5/1/24 10:04, Jesper Dangaard Brouer wrote:
>>> This closely resembles helpers added for the global 
>>> cgroup_rstat_lock in
>>> commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and
>>> tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.
>>>
>>> Based on production workloads, we observe the fast-path "update" 
>>> function
>>> cgroup_rstat_updated() is invoked around 3 million times per sec, 
>>> while the
>>> "flush" function cgroup_rstat_flush_locked(), walking each possible 
>>> CPU,
>>> can see periodic spikes of 700 invocations/sec.
>>>
>>> For this reason, the tracepoints are split into normal and fastpath
>>> versions for this per-CPU lock. Making it feasible for production to
>>> continuously monitor the non-fastpath tracepoint to detect lock 
>>> contention
>>> issues. The reason for monitoring is that lock disables IRQs which can
>>> disturb e.g. softirq processing on the local CPUs involved. When the
>>> global cgroup_rstat_lock stops disabling IRQs (e.g converted to a 
>>> mutex),
>>> this per CPU lock becomes the next bottleneck that can introduce 
>>> latency
>>> variations.
>>>
>>> A practical bpftrace script for monitoring contention latency:
>>>
>>>   bpftrace -e '
>>>     tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
>>>       @start[tid]=nsecs; @cnt[probe]=count()}
>>>     tracepoint:cgroup:cgroup_rstat_cpu_locked {
>>>       if (args->contended) {
>>>         @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}
>>>       @cnt[probe]=count()}
>>>     interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); 
>>> clear(@cnt);}'
>>
>> This is a per-cpu lock. So the only possible contention involves only 
>> 2 CPUs - a local CPU invoking cgroup_rstat_updated(). A flusher CPU 
>> doing cgroup_rstat_flush_locked() calling into 
>> cgroup_rstat_updated_list(). With recent commits to reduce the percpu 
>> lock hold time, I doubt lock contention on the percpu lock will have 
>> a great impact on latency. 
>
> I do appriciate your recent changes to reduce the percpu lock hold time.
> These tracepoints allow me to measure and differentiate the percpu lock
> hold time vs. the flush time.
>
> In production (using [1]) I'm seeing "Long lock-hold time" [L100] e.g.
> upto 29 ms, which is time spend after obtaining the lock (runtime under
> lock).  I was expecting to see "High Lock-contention wait" [L82] which
> is the time waiting for obtaining the lock.
>
> This is why I'm adding these tracepoints, as they allow me to digg
> deeper, to understand where this high runtime variations originate from.
>
>
> Data:
>
>  16:52:09 Long lock-hold time: 14950 usec (14 ms) on CPU:34 comm:kswapd4
>  16:52:09 Long lock-hold time: 14821 usec (14 ms) on CPU:34 comm:kswapd4
>  16:52:09 Long lock-hold time: 11299 usec (11 ms) on CPU:98 comm:kswapd4
>  16:52:09 Long lock-hold time: 17237 usec (17 ms) on CPU:113 comm:kswapd6
>  16:52:09 Long lock-hold time: 29000 usec (29 ms) on CPU:36 
> comm:kworker/u261:12
That lock hold time is much higher than I would have expected.
>  16:52:09 time elapsed: 80 sec (interval = 1 sec)
>   Flushes(5033) 294/interval (avg 62/sec)
>   Locks(53374) 1748/interval (avg 667/sec)
>   Yields(48341) 1454/interval (avg 604/sec)
>   Contended(48104) 1450/interval (avg 601/sec)
>
>
>> So do we really need such an elaborate scheme to monitor this? BTW, 
>> the additional code will also add to the worst case latency.
>
> Hmm, I designed this code to have minimal impact, as tracepoints are
> no-ops until activated.  I really doubt this code will change the 
> latency.
>
>
> [1] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
>
> [L100] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L100
>
> [L82] 
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L82
>
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>
> More data, the histogram of time spend under the lock have some strange
> variation issues with a group in 4ms to 65ms area. Investigating what
> can be causeing this... which next step depend in these tracepoints.
>
> @lock_cnt: 759146
>
> @locked_ns:
> [1K, 2K)             499 |      |
> [2K, 4K)          206928 
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
> [4K, 8K)          147904 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
> [8K, 16K)          64453 |@@@@@@@@@@@@@@@@      |
> [16K, 32K)        135467 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [32K, 64K)         75943 |@@@@@@@@@@@@@@@@@@@      |
> [64K, 128K)        38359 |@@@@@@@@@      |
> [128K, 256K)       46597 |@@@@@@@@@@@      |
> [256K, 512K)       32466 |@@@@@@@@      |
> [512K, 1M)          3945 |      |
> [1M, 2M)             642 |      |
> [2M, 4M)             750 |      |
> [4M, 8M)            1932 |      |
> [8M, 16M)           2114 |      |
> [16M, 32M)          1039 |      |
> [32M, 64M)           108 |      |
>
>
>
>
>>> ---
>>>   include/trace/events/cgroup.h |   56 
>>> +++++++++++++++++++++++++++++----
>>>   kernel/cgroup/rstat.c         |   70 
>>> ++++++++++++++++++++++++++++++++++-------
>>>   2 files changed, 108 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/include/trace/events/cgroup.h 
>>> b/include/trace/events/cgroup.h
>>> index 13f375800135..0b95865a90f3 100644
>>> --- a/include/trace/events/cgroup.h
>>> +++ b/include/trace/events/cgroup.h
>>> @@ -206,15 +206,15 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
>>>   DECLARE_EVENT_CLASS(cgroup_rstat,
>>> -    TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended),
>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> -    TP_ARGS(cgrp, cpu_in_loop, contended),
>>> +    TP_ARGS(cgrp, cpu, contended),
>>>       TP_STRUCT__entry(
>>>           __field(    int,        root            )
>>>           __field(    int,        level            )
>>>           __field(    u64,        id            )
>>> -        __field(    int,        cpu_in_loop        )
>>> +        __field(    int,        cpu            )
>>>           __field(    bool,        contended        )
>>>       ),
>>> @@ -222,15 +222,16 @@ DECLARE_EVENT_CLASS(cgroup_rstat,
>>>           __entry->root = cgrp->root->hierarchy_id;
>>>           __entry->id = cgroup_id(cgrp);
>>>           __entry->level = cgrp->level;
>>> -        __entry->cpu_in_loop = cpu_in_loop;
>>> +        __entry->cpu = cpu;
>>>           __entry->contended = contended;
>>>       ),
>>> -    TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock 
>>> contended:%d",
>>> +    TP_printk("root=%d id=%llu level=%d cpu=%d lock contended:%d",
>>>             __entry->root, __entry->id, __entry->level,
>>> -          __entry->cpu_in_loop, __entry->contended)
>>> +          __entry->cpu, __entry->contended)
>>>   );
>>> +/* Related to global: cgroup_rstat_lock */
>>>   DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
>>>       TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> @@ -252,6 +253,49 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
>>>       TP_ARGS(cgrp, cpu, contended)
>>>   );
>>> +/* Related to per CPU: cgroup_rstat_cpu_lock */
>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended,
>>> +
>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> +
>>> +    TP_ARGS(cgrp, cpu, contended)
>>> +);
>>> +
>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath,
>>> +
>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> +
>>> +    TP_ARGS(cgrp, cpu, contended)
>>> +);
>>> +
>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked,
>>> +
>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> +
>>> +    TP_ARGS(cgrp, cpu, contended)
>>> +);
>>> +
>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath,
>>> +
>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> +
>>> +    TP_ARGS(cgrp, cpu, contended)
>>> +);
>>> +
>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock,
>>> +
>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> +
>>> +    TP_ARGS(cgrp, cpu, contended)
>>> +);
>>> +
>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath,
>>> +
>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>> +
>>> +    TP_ARGS(cgrp, cpu, contended)
>>> +);
>>> +
>>>   #endif /* _TRACE_CGROUP_H */
>>>   /* This part must be outside protection */
>>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>>> index 52e3b0ed1cee..fb8b49437573 100644
>>> --- a/kernel/cgroup/rstat.c
>>> +++ b/kernel/cgroup/rstat.c
>>> @@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu 
>>> *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
>>>       return per_cpu_ptr(cgrp->rstat_cpu, cpu);
>>>   }
>>> +/*
>>> + * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
>>> + *
>>> + * This makes it easier to diagnose locking issues and contention in
>>> + * production environments. The parameter @fast_path determine the
>>> + * tracepoints being added, allowing us to diagnose "flush" related
>>> + * operations without handling high-frequency fast-path "update" 
>>> events.
>>> + */
>>> +static __always_inline
>>> +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int 
>>> cpu,
>>> +                     struct cgroup *cgrp, const bool fast_path)
>>> +{
>>> +    unsigned long flags;
>>> +    bool contended;
>>> +
>>> +    /*
>>> +     * The _irqsave() is needed because cgroup_rstat_lock is
>>> +     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
>>> +     * this lock with the _irq() suffix only disables interrupts on
>>> +     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
>>> +     * interrupts on both configurations. The _irqsave() ensures
>>> +     * that interrupts are always disabled and later restored.
>>> +     */
>>> +    contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>>> +    if (contended) {
>>> +        if (fast_path)
>>> + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended);
>>> +        else
>>> +            trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, 
>>> contended);
>>> +
>>> +        raw_spin_lock_irqsave(cpu_lock, flags);

Could you do a local_irq_save() before calling trace_cgroup*() and 
raw_spin_lock()? Would that help in eliminating this high lock hold time?

You can also do a local_irq_save() first before the trylock. That will 
eliminate the duplicated irq_restore() and irq_save() when there is 
contention.

If not, there may be NMIs mixed in.

Thanks,
Longman

>>> +    }
>>> +
>>> +    if (fast_path)
>>> +        trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended);
>>> +    else
>>> +        trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended);
>>> +
>>> +    return flags;
>>> +}
>>> +
>>> +static __always_inline
>>> +void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
>>> +                  struct cgroup *cgrp, unsigned long flags,
>>> +                  const bool fast_path)
>>> +{
>>> +    if (fast_path)
>>> +        trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
>>> +    else
>>> +        trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false);
>>> +
>>> +    raw_spin_unlock_irqrestore(cpu_lock, flags);
>>> +}
>>> +
>>>   /**
>>>    * cgroup_rstat_updated - keep track of updated rstat_cpu
>>>    * @cgrp: target cgroup
>>> @@ -44,7 +98,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct 
>>> cgroup *cgrp, int cpu)
>>>       if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
>>>           return;
>>> -    raw_spin_lock_irqsave(cpu_lock, flags);
>>> +    flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
>>>       /* put @cgrp and all ancestors on the corresponding updated 
>>> lists */
>>>       while (true) {
>>> @@ -72,7 +126,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct 
>>> cgroup *cgrp, int cpu)
>>>           cgrp = parent;
>>>       }
>>> -    raw_spin_unlock_irqrestore(cpu_lock, flags);
>>> +    _cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
>>>   }
>>>   /**
>>> @@ -153,15 +207,7 @@ static struct cgroup 
>>> *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
>>>       struct cgroup *head = NULL, *parent, *child;
>>>       unsigned long flags;
>>> -    /*
>>> -     * The _irqsave() is needed because cgroup_rstat_lock is
>>> -     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
>>> -     * this lock with the _irq() suffix only disables interrupts on
>>> -     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
>>> -     * interrupts on both configurations. The _irqsave() ensures
>>> -     * that interrupts are always disabled and later restored.
>>> -     */
>>> -    raw_spin_lock_irqsave(cpu_lock, flags);
>>> +    flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
>>>       /* Return NULL if this subtree is not on-list */
>>>       if (!rstatc->updated_next)
>>> @@ -198,7 +244,7 @@ static struct cgroup 
>>> *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
>>>       if (child != root)
>>>           head = cgroup_rstat_push_children(head, child, cpu); 
>>> Changes by *Zhili Li* on 2024/04/18 9:51 PM
>>>   unlock_ret:
>>> -    raw_spin_unlock_irqrestore(cpu_lock, flags);
>>> +    _cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
>>>       return head;
>>>   }
>>>
>>>
>>
>