From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0508C4345F for ; Wed, 1 May 2024 18:42:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F119B6B0087; Wed, 1 May 2024 14:42:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC14C6B0088; Wed, 1 May 2024 14:42:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CECE16B0089; Wed, 1 May 2024 14:42:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A8C066B0087 for ; Wed, 1 May 2024 14:42:00 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 28249160984 for ; Wed, 1 May 2024 18:42:00 +0000 (UTC) X-FDA: 82070696400.01.E2F817A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf04.hostedemail.com (Postfix) with ESMTP id 2A3D84000C for ; Wed, 1 May 2024 18:41:58 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=X8Nvn+38; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714588918; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oRu+D1M6p0EyajE2UjtfBXYrZ3C2Id1s2UvhPHfErTU=; b=cGYfJpTR/0Get+FMCZpgg+M4IRCJAp6LFVDjmC24z4pXeF5g7DRTis+JQsYYj3bLxdQ/rc RVM2/Ij3dgAhlKAlG+F3tK05ZYGlHkOtVQDTTZ/jTmlh5UjovJ1pwjGmx1XdlqxZuR8PTE YeDyTpcJYykTK6P/qTp27rlmxYSeMj0= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=X8Nvn+38; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714588918; a=rsa-sha256; cv=none; b=cQgJvh1qtOFA3LkfWPDaq8LbqRUoYFP8YZfvgQwFWhfpWGR9He48ugfCGP/CJh/SMhRUo9 S13ozn+OyiIULSmHyqOtmlgLKzA2lwN4enU/AHwN5jA/5hFH/lfC6NeFQdUfCMlnED04JQ ULbT8uz1qOfYf8dl8/kqJxfSHKSOTek= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1714588917; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=oRu+D1M6p0EyajE2UjtfBXYrZ3C2Id1s2UvhPHfErTU=; b=X8Nvn+38vJQCofJ555teNjhmjL2JZgU3pCtZljpsuG3bhkPhCG4Qa1yiMsUZ8noEEKZFef KsAA+ykDUgvQpvp0pDfIkrZYJzNbKigGCQrFSjs2BCGz6GbxO4Cn138KPWFC14zOKF5ClE 8wmjRcMlEH7xY/ZA9rUUWx7ZcX2bgFc= Received: from mimecast-mx02.redhat.com (mx-ext.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-260-gsJa64B_MiWNU3tunt8kMA-1; Wed, 01 May 2024 14:41:51 -0400 X-MC-Unique: gsJa64B_MiWNU3tunt8kMA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 2F4431C05156; Wed, 1 May 2024 18:41:51 +0000 (UTC) Received: from [10.22.34.18] (unknown [10.22.34.18]) by smtp.corp.redhat.com (Postfix) with ESMTP id 352C71121312; Wed, 1 May 2024 18:41:50 +0000 (UTC) Content-Type: multipart/alternative; boundary="------------6JchC95ltvw4Y0gZTVxHEEe1" Message-ID: <203fdb35-f4cf-4754-9709-3c024eecade9@redhat.com> Date: Wed, 1 May 2024 14:41:49 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints To: Jesper Dangaard Brouer , tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, yosryahmed@google.com Cc: netdev@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, kernel-team@cloudflare.com, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior References: <171457225108.4159924.12821205549807669839.stgit@firesoul> <30d64e25-561a-41c6-ab95-f0820248e9b6@redhat.com> <4a680b80-b296-4466-895a-13239b982c85@kernel.org> Content-Language: en-US From: Waiman Long In-Reply-To: <4a680b80-b296-4466-895a-13239b982c85@kernel.org> X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.3 X-Rspamd-Server: rspam01 X-Stat-Signature: pwt3jr171tb9uoxwur7qp8qz87xcic9h X-Rspam-User: X-Rspamd-Queue-Id: 2A3D84000C X-HE-Tag: 1714588918-248554 X-HE-Meta: U2FsdGVkX183m3eWYRfk+uOIC+f7FVAo2Dhlaqdmpqdk4vHrSkBf5H+5qaBDTrd/BDKQgg/NsYUPHTlcWx3UjympTM0HRFvYekZO+Uqx+n6cAYXthojRM6rCxBm1WEEqcmoOMlP3MoOkWupIm+OrwPcw519Brm6ptmX4BKaRnREEw1Nf2AJf+V2Vd2QSIDmoLhmu7sK3G771Qx5o/KMJuuBBh9yH0fQcz0Ac1kV0z1W04v92IHpjaYPxfFNZi5cZyMln8zzr/b3z9v+qp5LXtacYO41lnnZsPmAG/pyoUYkLZ0rwt3yqsvIZZ0Dc5oSQ5KB7gpEVlqpFKAUlRucvpSmxOhCKVRrM+Ed19nZRTIKlKY7OXskIxMOfl6FSNAZ2D+NY6xl/5spg6nmNUqrxDtqZhmlzGLd7DC7eIduvHx6tKoTYJvJyJcaTiJeEmUu/SIWL0mnvJUIYtxvP15gEY4Rs94kiCEUGROrTBhjgEEcaCt9eAljjTMDACeM2Gg1Q7hzfPa+LpTUD2NTcJTLgLplCwPrKkRvNE0qD8uytGco2FOHq6H/gf1I/EBDNrisX1PaTqbM8GmlPShvK+2mevkLUpTYM/qcbXBx4twMQlAIUfWv+UrxROIjeayN4ENNMrCqNTOlcAmeWAUszjctOz+vshnFy6X9Zl7sAJgUJMIck/tYKo7z/LTbpPpioGv0C+J5wh0AriXzIUcxRMbmPHSFn53aHAt52spar3RFurrWab7K85OXTY5O9oXy5pPI7qdgMC17YarEo4/E/EnSUYKbYKYShIfUWqe2Pz4ecowgPqMw7QSQE3ZiEee36aUlCLfbf7l1PSy2XQoYD86uSeTctKIq1ISbRvLLM+GkkzHjk/NmzCGmIQ5B/YlQRV4tQzJie5FV4V8+ex/1iGipI978wa5yxJRPKGhmYvqav5zPv3EeUybPmOrBdE+YyY5Y2zWE2Sz3UZIPDt8sFA+o fsqRgi2C hsh7RQL6CDkLzUhMA85sgEkqx+FwvWUXBM8QhQFnFs6HrZHieOYGtlsjnREZ2yYqf4Pug54dguZOGScmH8xWbwSkqzKd7zXwUozOroW+UC2mJxsN+lTkYenvZ3pjc6tgOU1LHgElw3NpLJmooWd2jtJiCQf6BzwAKtIV1deH4oI4hRK0MP1plRfroq9oBAi7FU6Ef4riXw+iJk5HfrEyLQaqFaVjVqPCOyG17IvGyDjKrKsoVvo/XTLkqh1Q44rp8gU53kZoL8f5xbc9BUrEuumFmSKq4Sw4Gak+eVwS526U8dMRGh6tRUrmOo9qadPac1T1xV6pzsZ2ZRk5wH9ILcicR7SKXffIijXyA2ZxBn3nUqLu32LVLrbVahsI9BTWt/LK0rDn6pa0Mp4BOBFgXdjpBrPok3Gfh95sBr2VhvgXVERc7Wk1TgRWgGSWUYbuGPaYoZNaB0YCyn77+YwWdBBqmb/a18b38uLZvOHIGM0VLOjYn2ayQdJiz8Gl48yl30teRKGU8Cw5KcKU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is a multi-part message in MIME format. --------------6JchC95ltvw4Y0gZTVxHEEe1 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 5/1/24 13:22, Jesper Dangaard Brouer wrote: > > > On 01/05/2024 16.24, Waiman Long wrote: >> On 5/1/24 10:04, Jesper Dangaard Brouer wrote: >>> This closely resembles helpers added for the global >>> cgroup_rstat_lock in >>> commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and >>> tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. >>> >>> Based on production workloads, we observe the fast-path "update" >>> function >>> cgroup_rstat_updated() is invoked around 3 million times per sec, >>> while the >>> "flush" function cgroup_rstat_flush_locked(), walking each possible >>> CPU, >>> can see periodic spikes of 700 invocations/sec. >>> >>> For this reason, the tracepoints are split into normal and fastpath >>> versions for this per-CPU lock. Making it feasible for production to >>> continuously monitor the non-fastpath tracepoint to detect lock >>> contention >>> issues. The reason for monitoring is that lock disables IRQs which can >>> disturb e.g. softirq processing on the local CPUs involved. When the >>> global cgroup_rstat_lock stops disabling IRQs (e.g converted to a >>> mutex), >>> this per CPU lock becomes the next bottleneck that can introduce >>> latency >>> variations. >>> >>> A practical bpftrace script for monitoring contention latency: >>> >>>   bpftrace -e ' >>>     tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { >>>       @start[tid]=nsecs; @cnt[probe]=count()} >>>     tracepoint:cgroup:cgroup_rstat_cpu_locked { >>>       if (args->contended) { >>>         @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} >>>       @cnt[probe]=count()} >>>     interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); >>> clear(@cnt);}' >> >> This is a per-cpu lock. So the only possible contention involves only >> 2 CPUs - a local CPU invoking cgroup_rstat_updated(). A flusher CPU >> doing cgroup_rstat_flush_locked() calling into >> cgroup_rstat_updated_list(). With recent commits to reduce the percpu >> lock hold time, I doubt lock contention on the percpu lock will have >> a great impact on latency. > > I do appriciate your recent changes to reduce the percpu lock hold time. > These tracepoints allow me to measure and differentiate the percpu lock > hold time vs. the flush time. > > In production (using [1]) I'm seeing "Long lock-hold time" [L100] e.g. > upto 29 ms, which is time spend after obtaining the lock (runtime under > lock).  I was expecting to see "High Lock-contention wait" [L82] which > is the time waiting for obtaining the lock. > > This is why I'm adding these tracepoints, as they allow me to digg > deeper, to understand where this high runtime variations originate from. > > > Data: > >  16:52:09 Long lock-hold time: 14950 usec (14 ms) on CPU:34 comm:kswapd4 >  16:52:09 Long lock-hold time: 14821 usec (14 ms) on CPU:34 comm:kswapd4 >  16:52:09 Long lock-hold time: 11299 usec (11 ms) on CPU:98 comm:kswapd4 >  16:52:09 Long lock-hold time: 17237 usec (17 ms) on CPU:113 comm:kswapd6 >  16:52:09 Long lock-hold time: 29000 usec (29 ms) on CPU:36 > comm:kworker/u261:12 That lock hold time is much higher than I would have expected. >  16:52:09 time elapsed: 80 sec (interval = 1 sec) >   Flushes(5033) 294/interval (avg 62/sec) >   Locks(53374) 1748/interval (avg 667/sec) >   Yields(48341) 1454/interval (avg 604/sec) >   Contended(48104) 1450/interval (avg 601/sec) > > >> So do we really need such an elaborate scheme to monitor this? BTW, >> the additional code will also add to the worst case latency. > > Hmm, I designed this code to have minimal impact, as tracepoints are > no-ops until activated.  I really doubt this code will change the > latency. > > > [1] > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt > > [L100] > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L100 > > [L82] > https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L82 > >>> >>> Signed-off-by: Jesper Dangaard Brouer > > More data, the histogram of time spend under the lock have some strange > variation issues with a group in 4ms to 65ms area. Investigating what > can be causeing this... which next step depend in these tracepoints. > > @lock_cnt: 759146 > > @locked_ns: > [1K, 2K)             499 |      | > [2K, 4K)          206928 > |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| > [4K, 8K)          147904 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      | > [8K, 16K)          64453 |@@@@@@@@@@@@@@@@      | > [16K, 32K)        135467 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | > [32K, 64K)         75943 |@@@@@@@@@@@@@@@@@@@      | > [64K, 128K)        38359 |@@@@@@@@@      | > [128K, 256K)       46597 |@@@@@@@@@@@      | > [256K, 512K)       32466 |@@@@@@@@      | > [512K, 1M)          3945 |      | > [1M, 2M)             642 |      | > [2M, 4M)             750 |      | > [4M, 8M)            1932 |      | > [8M, 16M)           2114 |      | > [16M, 32M)          1039 |      | > [32M, 64M)           108 |      | > > > > >>> --- >>>   include/trace/events/cgroup.h |   56 >>> +++++++++++++++++++++++++++++---- >>>   kernel/cgroup/rstat.c         |   70 >>> ++++++++++++++++++++++++++++++++++------- >>>   2 files changed, 108 insertions(+), 18 deletions(-) >>> >>> diff --git a/include/trace/events/cgroup.h >>> b/include/trace/events/cgroup.h >>> index 13f375800135..0b95865a90f3 100644 >>> --- a/include/trace/events/cgroup.h >>> +++ b/include/trace/events/cgroup.h >>> @@ -206,15 +206,15 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen, >>>   DECLARE_EVENT_CLASS(cgroup_rstat, >>> -    TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended), >>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> -    TP_ARGS(cgrp, cpu_in_loop, contended), >>> +    TP_ARGS(cgrp, cpu, contended), >>>       TP_STRUCT__entry( >>>           __field(    int,        root            ) >>>           __field(    int,        level            ) >>>           __field(    u64,        id            ) >>> -        __field(    int,        cpu_in_loop        ) >>> +        __field(    int,        cpu            ) >>>           __field(    bool,        contended        ) >>>       ), >>> @@ -222,15 +222,16 @@ DECLARE_EVENT_CLASS(cgroup_rstat, >>>           __entry->root = cgrp->root->hierarchy_id; >>>           __entry->id = cgroup_id(cgrp); >>>           __entry->level = cgrp->level; >>> -        __entry->cpu_in_loop = cpu_in_loop; >>> +        __entry->cpu = cpu; >>>           __entry->contended = contended; >>>       ), >>> -    TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock >>> contended:%d", >>> +    TP_printk("root=%d id=%llu level=%d cpu=%d lock contended:%d", >>>             __entry->root, __entry->id, __entry->level, >>> -          __entry->cpu_in_loop, __entry->contended) >>> +          __entry->cpu, __entry->contended) >>>   ); >>> +/* Related to global: cgroup_rstat_lock */ >>>   DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended, >>>       TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> @@ -252,6 +253,49 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock, >>>       TP_ARGS(cgrp, cpu, contended) >>>   ); >>> +/* Related to per CPU: cgroup_rstat_cpu_lock */ >>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended, >>> + >>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> + >>> +    TP_ARGS(cgrp, cpu, contended) >>> +); >>> + >>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath, >>> + >>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> + >>> +    TP_ARGS(cgrp, cpu, contended) >>> +); >>> + >>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked, >>> + >>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> + >>> +    TP_ARGS(cgrp, cpu, contended) >>> +); >>> + >>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath, >>> + >>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> + >>> +    TP_ARGS(cgrp, cpu, contended) >>> +); >>> + >>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock, >>> + >>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> + >>> +    TP_ARGS(cgrp, cpu, contended) >>> +); >>> + >>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath, >>> + >>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>> + >>> +    TP_ARGS(cgrp, cpu, contended) >>> +); >>> + >>>   #endif /* _TRACE_CGROUP_H */ >>>   /* This part must be outside protection */ >>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c >>> index 52e3b0ed1cee..fb8b49437573 100644 >>> --- a/kernel/cgroup/rstat.c >>> +++ b/kernel/cgroup/rstat.c >>> @@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu >>> *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) >>>       return per_cpu_ptr(cgrp->rstat_cpu, cpu); >>>   } >>> +/* >>> + * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock). >>> + * >>> + * This makes it easier to diagnose locking issues and contention in >>> + * production environments. The parameter @fast_path determine the >>> + * tracepoints being added, allowing us to diagnose "flush" related >>> + * operations without handling high-frequency fast-path "update" >>> events. >>> + */ >>> +static __always_inline >>> +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int >>> cpu, >>> +                     struct cgroup *cgrp, const bool fast_path) >>> +{ >>> +    unsigned long flags; >>> +    bool contended; >>> + >>> +    /* >>> +     * The _irqsave() is needed because cgroup_rstat_lock is >>> +     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring >>> +     * this lock with the _irq() suffix only disables interrupts on >>> +     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables >>> +     * interrupts on both configurations. The _irqsave() ensures >>> +     * that interrupts are always disabled and later restored. >>> +     */ >>> +    contended = !raw_spin_trylock_irqsave(cpu_lock, flags); >>> +    if (contended) { >>> +        if (fast_path) >>> + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended); >>> +        else >>> +            trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, >>> contended); >>> + >>> +        raw_spin_lock_irqsave(cpu_lock, flags); Could you do a local_irq_save() before calling trace_cgroup*() and raw_spin_lock()? Would that help in eliminating this high lock hold time? You can also do a local_irq_save() first before the trylock. That will eliminate the duplicated irq_restore() and irq_save() when there is contention. If not, there may be NMIs mixed in. Thanks, Longman >>> +    } >>> + >>> +    if (fast_path) >>> +        trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended); >>> +    else >>> +        trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended); >>> + >>> +    return flags; >>> +} >>> + >>> +static __always_inline >>> +void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu, >>> +                  struct cgroup *cgrp, unsigned long flags, >>> +                  const bool fast_path) >>> +{ >>> +    if (fast_path) >>> +        trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false); >>> +    else >>> +        trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false); >>> + >>> +    raw_spin_unlock_irqrestore(cpu_lock, flags); >>> +} >>> + >>>   /** >>>    * cgroup_rstat_updated - keep track of updated rstat_cpu >>>    * @cgrp: target cgroup >>> @@ -44,7 +98,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct >>> cgroup *cgrp, int cpu) >>>       if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next)) >>>           return; >>> -    raw_spin_lock_irqsave(cpu_lock, flags); >>> +    flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true); >>>       /* put @cgrp and all ancestors on the corresponding updated >>> lists */ >>>       while (true) { >>> @@ -72,7 +126,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct >>> cgroup *cgrp, int cpu) >>>           cgrp = parent; >>>       } >>> -    raw_spin_unlock_irqrestore(cpu_lock, flags); >>> +    _cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true); >>>   } >>>   /** >>> @@ -153,15 +207,7 @@ static struct cgroup >>> *cgroup_rstat_updated_list(struct cgroup *root, int cpu) >>>       struct cgroup *head = NULL, *parent, *child; >>>       unsigned long flags; >>> -    /* >>> -     * The _irqsave() is needed because cgroup_rstat_lock is >>> -     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring >>> -     * this lock with the _irq() suffix only disables interrupts on >>> -     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables >>> -     * interrupts on both configurations. The _irqsave() ensures >>> -     * that interrupts are always disabled and later restored. >>> -     */ >>> -    raw_spin_lock_irqsave(cpu_lock, flags); >>> +    flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false); >>>       /* Return NULL if this subtree is not on-list */ >>>       if (!rstatc->updated_next) >>> @@ -198,7 +244,7 @@ static struct cgroup >>> *cgroup_rstat_updated_list(struct cgroup *root, int cpu) >>>       if (child != root) >>>           head = cgroup_rstat_push_children(head, child, cpu); >>> Changes by *Zhili Li* on 2024/04/18 9:51 PM >>>   unlock_ret: >>> -    raw_spin_unlock_irqrestore(cpu_lock, flags); >>> +    _cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false); >>>       return head; >>>   } >>> >>> >> > --------------6JchC95ltvw4Y0gZTVxHEEe1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
On 5/1/24 13:22, Jesper Dangaard Brouer wrote:


On 01/05/2024 16.24, Waiman Long wrote:
On 5/1/24 10:04, Jesper Dangaard Brouer wrote:
This closely resembles helpers added for the global cgroup_rstat_lock in
commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and
tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.

Based on production workloads, we observe the fast-path "update" function
cgroup_rstat_updated() is invoked around 3 million times per sec, while the
"flush" function cgroup_rstat_flush_locked(), walking each possible CPU,
can see periodic spikes of 700 invocations/sec.

For this reason, the tracepoints are split into normal and fastpath
versions for this per-CPU lock. Making it feasible for production to
continuously monitor the non-fastpath tracepoint to detect lock contention
issues. The reason for monitoring is that lock disables IRQs which can
disturb e.g. softirq processing on the local CPUs involved. When the
global cgroup_rstat_lock stops disabling IRQs (e.g converted to a mutex),
this per CPU lock becomes the next bottleneck that can introduce latency
variations.

A practical bpftrace script for monitoring contention latency:

  bpftrace -e '
    tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
      @start[tid]=nsecs; @cnt[probe]=count()}
    tracepoint:cgroup:cgroup_rstat_cpu_locked {
      if (args->contended) {
        @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}
      @cnt[probe]=count()}
    interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); clear(@cnt);}'

This is a per-cpu lock. So the only possible contention involves only 2 CPUs - a local CPU invoking cgroup_rstat_updated(). A flusher CPU doing cgroup_rstat_flush_locked() calling into cgroup_rstat_updated_list(). With recent commits to reduce the percpu lock hold time, I doubt lock contention on the percpu lock will have a great impact on latency.

I do appriciate your recent changes to reduce the percpu lock hold time.
These tracepoints allow me to measure and differentiate the percpu lock
hold time vs. the flush time.

In production (using [1]) I'm seeing "Long lock-hold time" [L100] e.g.
upto 29 ms, which is time spend after obtaining the lock (runtime under
lock).  I was expecting to see "High Lock-contention wait" [L82] which
is the time waiting for obtaining the lock.

This is why I'm adding these tracepoints, as they allow me to digg
deeper, to understand where this high runtime variations originate from.


Data:

 16:52:09 Long lock-hold time: 14950 usec (14 ms) on CPU:34 comm:kswapd4
 16:52:09 Long lock-hold time: 14821 usec (14 ms) on CPU:34 comm:kswapd4
 16:52:09 Long lock-hold time: 11299 usec (11 ms) on CPU:98 comm:kswapd4
 16:52:09 Long lock-hold time: 17237 usec (17 ms) on CPU:113 comm:kswapd6
 16:52:09 Long lock-hold time: 29000 usec (29 ms) on CPU:36 comm:kworker/u261:12
That lock hold time is much higher than I would have expected.
 16:52:09  time elapsed: 80 sec (interval = 1 sec)
  Flushes(5033) 294/interval (avg 62/sec)
  Locks(53374) 1748/interval (avg 667/sec)
  Yields(48341) 1454/interval (avg 604/sec)
  Contended(48104) 1450/interval (avg 601/sec)


So do we really need such an elaborate scheme to monitor this? BTW, the additional code will also add to the worst case latency.

Hmm, I designed this code to have minimal impact, as tracepoints are
no-ops until activated.  I really doubt this code will change the latency.


[1] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt

[L100] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L100

[L82] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L82


Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

More data, the histogram of time spend under the lock have some strange
variation issues with a group in 4ms to 65ms area. Investigating what
can be causeing this... which next step depend in these tracepoints.

@lock_cnt: 759146

@locked_ns:
[1K, 2K)             499 |      |
[2K, 4K)          206928 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)          147904 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[8K, 16K)          64453 |@@@@@@@@@@@@@@@@      |
[16K, 32K)        135467 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         75943 |@@@@@@@@@@@@@@@@@@@      |
[64K, 128K)        38359 |@@@@@@@@@      |
[128K, 256K)       46597 |@@@@@@@@@@@      |
[256K, 512K)       32466 |@@@@@@@@      |
[512K, 1M)          3945 |      |
[1M, 2M)             642 |      |
[2M, 4M)             750 |      |
[4M, 8M)            1932 |      |
[8M, 16M)           2114 |      |
[16M, 32M)          1039 |      |
[32M, 64M)           108 |      |




---
  include/trace/events/cgroup.h |   56 +++++++++++++++++++++++++++++----
  kernel/cgroup/rstat.c         |   70 ++++++++++++++++++++++++++++++++++-------
  2 files changed, 108 insertions(+), 18 deletions(-)

diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index 13f375800135..0b95865a90f3 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -206,15 +206,15 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
  DECLARE_EVENT_CLASS(cgroup_rstat,
-    TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended),
+    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
-    TP_ARGS(cgrp, cpu_in_loop, contended),
+    TP_ARGS(cgrp, cpu, contended),
      TP_STRUCT__entry(
          __field(    int,        root            )
          __field(    int,        level            )
          __field(    u64,        id            )
-        __field(    int,        cpu_in_loop        )
+        __field(    int,        cpu            )
          __field(    bool,        contended        )
      ),
@@ -222,15 +222,16 @@ DECLARE_EVENT_CLASS(cgroup_rstat,
          __entry->root = cgrp->root->hierarchy_id;
          __entry->id = cgroup_id(cgrp);
          __entry->level = cgrp->level;
-        __entry->cpu_in_loop = cpu_in_loop;
+        __entry->cpu = cpu;
          __entry->contended = contended;
      ),
-    TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock contended:%d",
+    TP_printk("root=%d id=%llu level=%d cpu=%d lock contended:%d",
            __entry->root, __entry->id, __entry->level,
-          __entry->cpu_in_loop, __entry->contended)
+          __entry->cpu, __entry->contended)
  );
+/* Related to global: cgroup_rstat_lock */
  DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
      TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
@@ -252,6 +253,49 @@ DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
      TP_ARGS(cgrp, cpu, contended)
  );
+/* Related to per CPU: cgroup_rstat_cpu_lock */
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended,
+
+    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+    TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath,
+
+    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+    TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked,
+
+    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+    TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath,
+
+    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+    TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock,
+
+    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+    TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath,
+
+    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+    TP_ARGS(cgrp, cpu, contended)
+);
+
  #endif /* _TRACE_CGROUP_H */
  /* This part must be outside protection */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 52e3b0ed1cee..fb8b49437573 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
      return per_cpu_ptr(cgrp->rstat_cpu, cpu);
  }
+/*
+ * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments. The parameter @fast_path determine the
+ * tracepoints being added, allowing us to diagnose "flush" related
+ * operations without handling high-frequency fast-path "update" events.
+ */
+static __always_inline
+unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
+                     struct cgroup *cgrp, const bool fast_path)
+{
+    unsigned long flags;
+    bool contended;
+
+    /*
+     * The _irqsave() is needed because cgroup_rstat_lock is
+     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
+     * this lock with the _irq() suffix only disables interrupts on
+     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
+     * interrupts on both configurations. The _irqsave() ensures
+     * that interrupts are always disabled and later restored.
+     */
+    contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
+    if (contended) {
+        if (fast_path)
+            trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended);
+        else
+            trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended);
+
+        raw_spin_lock_irqsave(cpu_lock, flags);

Could you do a local_irq_save() before calling trace_cgroup*() and raw_spin_lock()? Would that help in eliminating this high lock hold time?

You can also do a local_irq_save() first before the trylock. That will eliminate the duplicated irq_restore() and irq_save() when there is contention.

If not, there may be NMIs mixed in.

Thanks,
Longman

+    }
+
+    if (fast_path)
+        trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended);
+    else
+        trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended);
+
+    return flags;
+}
+
+static __always_inline
+void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
+                  struct cgroup *cgrp, unsigned long flags,
+                  const bool fast_path)
+{
+    if (fast_path)
+        trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
+    else
+        trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false);
+
+    raw_spin_unlock_irqrestore(cpu_lock, flags);
+}
+
  /**
   * cgroup_rstat_updated - keep track of updated rstat_cpu
   * @cgrp: target cgroup
@@ -44,7 +98,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
      if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
          return;
-    raw_spin_lock_irqsave(cpu_lock, flags);
+    flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
      /* put @cgrp and all ancestors on the corresponding updated lists */
      while (true) {
@@ -72,7 +126,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
          cgrp = parent;
      }
-    raw_spin_unlock_irqrestore(cpu_lock, flags);
+    _cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
  }
  /**
@@ -153,15 +207,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
      struct cgroup *head = NULL, *parent, *child;
      unsigned long flags;
-    /*
-     * The _irqsave() is needed because cgroup_rstat_lock is
-     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
-     * this lock with the _irq() suffix only disables interrupts on
-     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
-     * interrupts on both configurations. The _irqsave() ensures
-     * that interrupts are always disabled and later restored.
-     */
-    raw_spin_lock_irqsave(cpu_lock, flags);
+    flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
      /* Return NULL if this subtree is not on-list */
      if (!rstatc->updated_next)
@@ -198,7 +244,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
      if (child != root)
          head = cgroup_rstat_push_children(head, child, cpu); Changes by Zhili Li on 2024/04/18 9:51 PM
  unlock_ret:
-    raw_spin_unlock_irqrestore(cpu_lock, flags);
+    _cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
      return head;
  }




--------------6JchC95ltvw4Y0gZTVxHEEe1--