From: JP Kobryn <inwardvessel@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
hannes@cmpxchg.org, akpm@linux-foundation.org,
rostedt@goodmis.org, linux-mm@kvack.org, cgroups@vger.kernel.org
Subject: Re: [PATCH 2/2] memcg: use memcg flush tracepoint
Date: Fri, 25 Oct 2024 11:26:54 -0700 [thread overview]
Message-ID: <8c44b4d3-0083-42b6-8006-adcfaee04e19@gmail.com> (raw)
In-Reply-To: <CAJD7tkZ=YB66T4-j-qKxzYktiKqmfufg_eTjHo_6W7eQjqSXmg@mail.gmail.com>
On 10/25/24 10:53 AM, Yosry Ahmed wrote:
> On Fri, Oct 25, 2024 at 10:05 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> On 10/25/24 12:40 AM, Yosry Ahmed wrote:
>>> On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
>>>>> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>>> Make use of the flush tracepoint within memcontrol.
>>>>>>
>>>>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>>>> Is the intention to use tools like bpftrace to analyze where we flush
>>>>> the most? In this case, why can't we just attach to the fentry of
>>>>> do_flush_stats() and use the stack trace to find the path?
>>>>>
>>>>> We can also attach to mem_cgroup_flush_stats(), and the difference in
>>>>> counts between the two will be the number of skipped flushes.
>>>>>
>>>> All these functions can get inlined and then we can not really attach
>>>> easily. We can somehow find the offset in the inlined places and try to
>>>> use kprobe but it is prohibitive when have to do for multiple kernels
>>>> built with fdo/bolt.
>>>>
>>>> Please note that tracepoints are not really API, so we can remove them
>>>> in future if we see no usage for them.
>>> That's fair, but can we just add two tracepoints? This seems enough to
>>> collect necessary data, and prevent proliferation of tracepoints and
>>> the addition of the enum.
>>>
>>> I am thinking one in mem_cgroup_flush_stats() and one in
>>> do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
>>> trace_do_flush_stats(). Although the name of the latter is too
>>> generic, maybe we should rename the function first to add mem_cgroup_*
>>> or memcg_*.
>>>
>>> WDYT?
>> Hmmm, I think if we did that we wouldn't get accurate info on when the
>> flush was skipped. Comparing the number of hits between
>> mem_cgroup_flush_stats() and do_flush_stats() to determine the number of
>> skips doesn't seem reliable because of the places where do_flush_stats()
>> is called outside of mem_cgroup_flush_stats(). There would be situations
>> where a skip occurs, but meanwhile each call to do_flush_stats() outside
>> of mem_cgroup_flush_stats() would effectively subtract that skip, making
>> it appear that a skip did not occur.
> You're underestimating the power of BPF, my friend :) We can count the
> number of flushes in task local storages, in which case we can get a
> very accurate representation because the counters are per-task, so we
> know exactly when we skipped, but..
Interesting. Thanks for explaining.
>
>> Maybe as a middle ground we could remove the trace calls for the zswap
>> and periodic cases, since no skips can occur there. We could then just
>> leave one trace call in mem_cgroup_flush_stats() and instead of an enum
>> we can pass a bool saying skipped or not. Something like this:
>>
>> mem_cgroup_flush_stats()
>>
>> {
>>
>> bool needs_flush = memcg_vmstats_needs_flush(...);
>>
>> trace_memcg_flush_stats(memcg, needs_flush);
>>
>> if (needs_flush)
>>
>> do_flush_stats(...);
>>
>> }
>>
>>
>> Yosry/Shakeel, do you have any thoughts on whether we should keep the
>> trace calls for obj_cgroup_may_zswap() and periodic workqueue cases?
> ..with that being said, I do like having a single tracepoint. I think
> with some refactoring we can end up with a single tracepoint and more
> data. We can even capture the cases where we force a flush but we
> don't really need to flush. We can even add vmstats->stats_updates to
> the tracepoint to know exactly how many updates we have when we flush.
>
> What about the following:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7845c64a2c570..be0e7f52ad11a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -584,8 +584,14 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
> }
> }
>
> -static void do_flush_stats(struct mem_cgroup *memcg)
> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
> {
> + bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats);
> +
> + trace_memcg_flush_stats(memcg, needs_flush, force, ...);
> + if (!force && !needs_flush)
> + return;
> +
> if (mem_cgroup_is_root(memcg))
> WRITE_ONCE(flush_last_time, jiffies_64);
>
> @@ -609,8 +615,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> if (!memcg)
> memcg = root_mem_cgroup;
>
> - if (memcg_vmstats_needs_flush(memcg->vmstats))
> - do_flush_stats(memcg);
> + __mem_cgroup_flush_stats(memcg, false);
> }
>
> void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
> @@ -626,7 +631,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
> * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing
> * in latency-sensitive paths is as cheap as possible.
> */
> - do_flush_stats(root_mem_cgroup);
> + __mem_cgroup_flush_stats(root_mem_cgroup, true);
> queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
> }
>
> @@ -5272,11 +5277,8 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> break;
> }
>
> - /*
> - * mem_cgroup_flush_stats() ignores small changes. Use
> - * do_flush_stats() directly to get accurate stats for charging.
> - */
> - do_flush_stats(memcg);
> + /* Force a flush to get accurate stats for charging */
> + __mem_cgroup_flush_stats(memcg, true);
> pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
> if (pages < max)
> continue;
I like the additional info. I'll incorporate this into v2.
prev parent reply other threads:[~2024-10-25 18:27 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 0:25 [PATCH 0/2] memcg: tracepoint for flushing stats JP Kobryn
2024-10-25 0:25 ` [PATCH 1/2] memcg: add memcg flush tracepoint event JP Kobryn
2024-10-25 0:25 ` [PATCH 2/2] memcg: use memcg flush tracepoint JP Kobryn
2024-10-25 0:57 ` Yosry Ahmed
2024-10-25 1:15 ` Shakeel Butt
2024-10-25 7:40 ` Yosry Ahmed
2024-10-25 17:04 ` JP Kobryn
2024-10-25 17:53 ` Yosry Ahmed
2024-10-25 18:26 ` JP Kobryn [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8c44b4d3-0083-42b6-8006-adcfaee04e19@gmail.com \
--to=inwardvessel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=rostedt@goodmis.org \
--cc=shakeel.butt@linux.dev \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox