Re: [PATCH 2/2] memcg: use memcg flush tracepoint

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: JP Kobryn <inwardvessel@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>,
	hannes@cmpxchg.org, akpm@linux-foundation.org,
	rostedt@goodmis.org, linux-mm@kvack.org, cgroups@vger.kernel.org
Subject: Re: [PATCH 2/2] memcg: use memcg flush tracepoint
Date: Fri, 25 Oct 2024 11:26:54 -0700	[thread overview]
Message-ID: <8c44b4d3-0083-42b6-8006-adcfaee04e19@gmail.com> (raw)
In-Reply-To: <CAJD7tkZ=YB66T4-j-qKxzYktiKqmfufg_eTjHo_6W7eQjqSXmg@mail.gmail.com>


On 10/25/24 10:53 AM, Yosry Ahmed wrote:
> On Fri, Oct 25, 2024 at 10:05 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> On 10/25/24 12:40 AM, Yosry Ahmed wrote:
>>> On Thu, Oct 24, 2024 at 6:16 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>>> On Thu, Oct 24, 2024 at 05:57:25PM GMT, Yosry Ahmed wrote:
>>>>> On Thu, Oct 24, 2024 at 5:26 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>>>>> Make use of the flush tracepoint within memcontrol.
>>>>>>
>>>>>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>>>>> Is the intention to use tools like bpftrace to analyze where we flush
>>>>> the most? In this case, why can't we just attach to the fentry of
>>>>> do_flush_stats() and use the stack trace to find the path?
>>>>>
>>>>> We can also attach to mem_cgroup_flush_stats(), and the difference in
>>>>> counts between the two will be the number of skipped flushes.
>>>>>
>>>> All these functions can get inlined and then we can not really attach
>>>> easily. We can somehow find the offset in the inlined places and try to
>>>> use kprobe but it is prohibitive when have to do for multiple kernels
>>>> built with fdo/bolt.
>>>>
>>>> Please note that tracepoints are not really API, so we can remove them
>>>> in future if we see no usage for them.
>>> That's fair, but can we just add two tracepoints? This seems enough to
>>> collect necessary data, and prevent proliferation of tracepoints and
>>> the addition of the enum.
>>>
>>> I am thinking one in mem_cgroup_flush_stats() and one in
>>> do_flush_stats(), e.g. trace_mem_cgroup_flush_stats() and
>>> trace_do_flush_stats(). Although the name of the latter is too
>>> generic, maybe we should rename the function first to add mem_cgroup_*
>>> or memcg_*.
>>>
>>> WDYT?
>> Hmmm, I think if we did that we wouldn't get accurate info on when the
>> flush was skipped. Comparing the number of hits between
>> mem_cgroup_flush_stats() and do_flush_stats() to determine the number of
>> skips doesn't seem reliable because of the places where do_flush_stats()
>> is called outside of mem_cgroup_flush_stats(). There would be situations
>> where a skip occurs, but meanwhile each call to do_flush_stats() outside
>> of mem_cgroup_flush_stats() would effectively subtract that skip, making
>> it appear that a skip did not occur.
> You're underestimating the power of BPF, my friend :) We can count the
> number of flushes in task local storages, in which case we can get a
> very accurate representation because the counters are per-task, so we
> know exactly when we skipped, but..
Interesting. Thanks for explaining.
>
>> Maybe as a middle ground we could remove the trace calls for the zswap
>> and periodic cases, since no skips can occur there. We could then just
>> leave one trace call in mem_cgroup_flush_stats() and instead of an enum
>> we can pass a bool saying skipped or not. Something like this:
>>
>> mem_cgroup_flush_stats()
>>
>> {
>>
>>       bool needs_flush = memcg_vmstats_needs_flush(...);
>>
>>       trace_memcg_flush_stats(memcg, needs_flush);
>>
>>       if (needs_flush)
>>
>>           do_flush_stats(...);
>>
>> }
>>
>>
>> Yosry/Shakeel, do you have any thoughts on whether we should keep the
>> trace calls for obj_cgroup_may_zswap() and periodic workqueue cases?
> ..with that being said, I do like having a single tracepoint. I think
> with some refactoring we can end up with a single tracepoint and more
> data. We can even capture the cases where we force a flush but we
> don't really need to flush. We can even add vmstats->stats_updates to
> the tracepoint to know exactly how many updates we have when we flush.
>
> What about the following:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7845c64a2c570..be0e7f52ad11a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -584,8 +584,14 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
>          }
>   }
>
> -static void do_flush_stats(struct mem_cgroup *memcg)
> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
>   {
> +       bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats);
> +
> +       trace_memcg_flush_stats(memcg, needs_flush, force, ...);
> +       if (!force && !needs_flush)
> +               return;
> +
>          if (mem_cgroup_is_root(memcg))
>                  WRITE_ONCE(flush_last_time, jiffies_64);
>
> @@ -609,8 +615,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>          if (!memcg)
>                  memcg = root_mem_cgroup;
>
> -       if (memcg_vmstats_needs_flush(memcg->vmstats))
> -               do_flush_stats(memcg);
> +       __mem_cgroup_flush_stats(memcg, false);
>   }
>
>   void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
> @@ -626,7 +631,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
>           * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing
>           * in latency-sensitive paths is as cheap as possible.
>           */
> -       do_flush_stats(root_mem_cgroup);
> +       __mem_cgroup_flush_stats(root_mem_cgroup, true);
>          queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
>   }
>
> @@ -5272,11 +5277,8 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>                          break;
>                  }
>
> -               /*
> -                * mem_cgroup_flush_stats() ignores small changes. Use
> -                * do_flush_stats() directly to get accurate stats for charging.
> -                */
> -               do_flush_stats(memcg);
> +               /* Force a flush to get accurate stats for charging */
> +               __mem_cgroup_flush_stats(memcg, true);
>                  pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
>                  if (pages < max)
>                          continue;
I like the additional info. I'll incorporate this into v2.

     prev parent reply	other threads:[~2024-10-25 18:27 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25  0:25 [PATCH 0/2] memcg: tracepoint for flushing stats JP Kobryn
2024-10-25  0:25 ` [PATCH 1/2] memcg: add memcg flush tracepoint event JP Kobryn
2024-10-25  0:25 ` [PATCH 2/2] memcg: use memcg flush tracepoint JP Kobryn
2024-10-25  0:57   ` Yosry Ahmed
2024-10-25  1:15     ` Shakeel Butt
2024-10-25  7:40       ` Yosry Ahmed
2024-10-25 17:04         ` JP Kobryn
2024-10-25 17:53           ` Yosry Ahmed
2024-10-25 18:26             ` JP Kobryn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8c44b4d3-0083-42b6-8006-adcfaee04e19@gmail.com \
    --to=inwardvessel@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=rostedt@goodmis.org \
    --cc=shakeel.butt@linux.dev \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox