From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 40612C3DA49
	for <linux-mm@archiver.kernel.org>; Tue, 23 Jul 2024 19:46:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 9B9716B0098; Tue, 23 Jul 2024 15:46:12 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 968CC6B0099; Tue, 23 Jul 2024 15:46:12 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7E3376B009A; Tue, 23 Jul 2024 15:46:12 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 5D8CE6B0098
	for <linux-mm@kvack.org>; Tue, 23 Jul 2024 15:46:12 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id F1B5916029E
	for <linux-mm@kvack.org>; Tue, 23 Jul 2024 19:46:11 +0000 (UTC)
X-FDA: 82372048542.04.0DD5083
Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175])
	by imf20.hostedemail.com (Postfix) with ESMTP id DC7CF1C0034
	for <linux-mm@kvack.org>; Tue, 23 Jul 2024 19:46:09 +0000 (UTC)
Authentication-Results: imf20.hostedemail.com;
	dkim=pass header.d=vimeo.com header.s=google header.b=S5YV7z6O;
	spf=pass (imf20.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=davidf@vimeo.com;
	dmarc=pass (policy=reject) header.from=vimeo.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721763923;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=i/qoSrcKP2dKBbT3yRh61Zf7itz7zeZaJ4QLe4Dv+Zw=;
	b=JX5bGzrD09aHOIQg70AgU/wppH2NWbqD349lRoqdbDXeceEApdN3OqzDgqbJ5DzhkIHIQC
	qNILvF7jJ644jiKugbQWlfmVv/zCLZuNT8EtlVlw8/X4FjCElEttZGrhNtYzPfOC+JucAZ
	SzRbwVefdQPm+0jex60nMEyAXw+5s5k=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721763923; a=rsa-sha256;
	cv=none;
	b=cUFK9KwMINq8HgEpyHRYxIDq2/0hvydwWleM4wO4Jzy4ps5icX3Ar6zf/zs0Sfru024F2j
	qDoVqcKTxfH2fMB3DXztzxf0BWYxMrFdXsWsUC4NLf1DrqpOISOCqFKa3cJPs1tj1rSpsg
	wqwUrV0zyz7aJ0i9HC4RqpcADDDaw8o=
ARC-Authentication-Results: i=1;
	imf20.hostedemail.com;
	dkim=pass header.d=vimeo.com header.s=google header.b=S5YV7z6O;
	spf=pass (imf20.hostedemail.com: domain of davidf@vimeo.com designates 209.85.210.175 as permitted sender) smtp.mailfrom=davidf@vimeo.com;
	dmarc=pass (policy=reject) header.from=vimeo.com
Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-70d24ee79fbso1974171b3a.2
        for <linux-mm@kvack.org>; Tue, 23 Jul 2024 12:46:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=vimeo.com; s=google; t=1721763968; x=1722368768; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=i/qoSrcKP2dKBbT3yRh61Zf7itz7zeZaJ4QLe4Dv+Zw=;
        b=S5YV7z6OELMyidyw0ve/Lvfo4MFPP6l4Ks5rMp+H16VJVoHh+BguV3s5Xy9XFW9OY8
         6lN5qwyrMnWmBnoglOiIs6TnpJo9vdV40KyZOwdXr+rtg1eUkK2dpHDENzRUN1PfQwuW
         nujBzK2s+riCbZBhO4G4BEiikDzfYQNBZ2gSg=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1721763968; x=1722368768;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=i/qoSrcKP2dKBbT3yRh61Zf7itz7zeZaJ4QLe4Dv+Zw=;
        b=MXuLkfevGAcLOSxKvACNKKp3FBarCzuNWOzIWnW2e0ZX6ByHbUxi+5O/z4ogakTc6u
         PAah7H8y9/0xg39VOwAKVmFbhEPpi34mnGM23TU536hJDoYgI7ouRWOsGnO8lnF8hyL9
         QC/UHBDbiqnAgimURWBzYg+wbklWRsaJCJzDpF/3CfXNYwoRWiEbJDJbIc6ymq6vqBes
         thFJPjhsCfGqIcsLyoC8fTcOH4ucSD0QZaPNhZX+NUmOAfn6K7/L9P22dqO1eaagTw++
         zyZ1VARmu/58MVFaPATyWUp7rTtCdGsbF5qDVP5z3wthoEtHNyQyi1DIgwPRv/ALzOaH
         O5/w==
X-Forwarded-Encrypted: i=1; AJvYcCX6NRkLfrAOlhpYnLL40Ke6YP9wQTyi7SXqsLsjWDJYZ7779nxtAn7yrE6J1/yGgWG3E1+hUMT36mXDndvQ07Q2t30=
X-Gm-Message-State: AOJu0YxqhVP0mcgZg71JgKmwyzwM5ICcbwPKS4hravFm3GxioObeEs38
	G4KCEODOQk5ZzLIZW3Is2DZ22BJ8ZfMmRa8c40j1JU5udBhjioplKdLwn8YwheaJk2o+S276hqT
	5KgKKQsY2c5BD3FXDyajUbzxrEOcyyWsnbL4rlw==
X-Google-Smtp-Source: AGHT+IEL5HyNzk2uLVja41grpBiyNCgiZgj+hAvvGWH7WBHjgfqSzSjxEemxHymWX1H5Bts7KKJ+pel+MKfPys146d4=
X-Received: by 2002:a05:6a00:2d09:b0:70b:260:3e2c with SMTP id
 d2e1a72fcca58-70e9dba5e39mr50315b3a.28.1721763968300; Tue, 23 Jul 2024
 12:46:08 -0700 (PDT)
MIME-Version: 1.0
References: <20240722235554.2911971-1-davidf@vimeo.com> <20240722235554.2911971-2-davidf@vimeo.com>
 <20240723142950.GA389003@cmpxchg.org>
In-Reply-To: <20240723142950.GA389003@cmpxchg.org>
From: David Finkel <davidf@vimeo.com>
Date: Tue, 23 Jul 2024 15:45:57 -0400
Message-ID: <CAFUnj5NSEpEUOi2CF6Y78ryv9ZS7+FG9qF4o1vCaLFX3KE9LAA@mail.gmail.com>
Subject: Re: [PATCH 1/2] mm, memcg: cg2 memory{.swap,}.peak write handlers
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>, Tejun Heo <tj@kernel.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, 
	core-services@vimeo.com, Jonathan Corbet <corbet@lwn.net>, Michal Hocko <mhocko@kernel.org>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Shuah Khan <shuah@kernel.org>, 
	Zefan Li <lizefan.x@bytedance.com>, cgroups@vger.kernel.org, 
	linux-doc@vger.kernel.org, linux-mm@kvack.org, 
	linux-kselftest@vger.kernel.org, Waiman Long <longman@redhat.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: DC7CF1C0034
X-Stat-Signature: u4yp5kswuqxc8w53ejnbcnn8qsrpxic9
X-HE-Tag: 1721763969-308487
X-HE-Meta: U2FsdGVkX18JOs9YPZuTY7/7fbHKXYT9OizN6ulYxKOymsKl27o0wDGweolI+ySTJQZ50XmfMg+qmaEozC7m7+2IZM5sSTA0H23K6GH6ph2C93i6Ch9ta+q9dOZR4Nbe+ch/mG6R8hvcP96vUeZ0hfQvRX6l074EU5qUmOGUUScvy0oTsu3ze18+zZYhhICii9KwH1+Gdn1eMj8IjWvwMr7zWYLF7Zy2RC56XK9mAeavMq8pBaR2DmnpYTSK/aDSxVdwj8kXph2SbmfSsnqNNblmLPjlSxID1JHUAm9InOv6iZp+nSkU9Vvf29lsK18I4PcY4KLhra5qoZTmMxxTA6Ibpb+L6jkDN/ozY3Pfn0rUhbu896B+yF441lI9i9cPJ6T6Y4ajajLIghmaAjI6K0/w22ahA0hjHqqCxDtLIPQnrqsrJSmZ2Lt+heoIvulsJd1gOZKV0kRiOWLBTmyYGwsQ0mnfWbHo6wLx5q9InPAnXK0mK4J4ujs+WSilyzMNSkE/j+eZo/S/isnrVxJUGrYajFBxGyaWJ9WubAoQZVfXGOaIT252x7N9mMfMenjrGC5+ivNlHZwSw1hAKdlhS73pOG1W/JW1vZYL8y8E6IzJN3SFXznOxG/Eil+FbuvHGYTU+tOCqSbODUh0lm4BO4WYdBElwmJrd/jJ4fVItGbLzywp4W15rUOXai5YPz0NagqxELIK1UDXoOjxatvZT9JbiEijcVQ19/UUBdC5aOvSO7HZYTP+USFHKd8ckvg7QZAauZAs9Lyd5Vw10VtZLo2IEqbD2bXhtvKzWworeFAoYI92d5toT+Cg3z2ddfqcKI22T11lJZ65kFor+t17pJVJCMmqD/OQsENa5SVUY2pyvLHtEeGJY2UmkbzgDKbLU45sbad4SoJkMl7Tb7HPrItD0Wf9r0BxnLyXVJci0YuuUMbMxDH+pgT/PXtYIZBSv1ceY0lFYzUsBBCvAiQ
 Wz1AuyY3
 8uVuSClLnvm5LwbgDdiZ3tDM2rRwbEPR6/eF3W/NEXxtvVpSW7vL+ybUA7MRsXWdHBWvTtdYnJO4pxppjTxXLcJdc206VJDkPAzkHI5oL50xdswL7W2Y07AfSIjxk3NzxOl/xfNd8wSRiJLQpJ7NRrtKa8X9MgpsKpHD7XgzprTszvCmLq6HI0tfRLrYzQkK9x+NmbXaKsUYjsBUlRdg+PvfBVBvSER9ok6bF
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi Johannes,

Thanks for the thorough review!

I'll send a rebased version addressing your comments in a moment.

On Tue, Jul 23, 2024 at 10:29=E2=80=AFAM Johannes Weiner <hannes@cmpxchg.or=
g> wrote:
>
> Hi David,
>
> thanks for pursuing this! A couple of comments below.
>
> On Mon, Jul 22, 2024 at 07:55:53PM -0400, David Finkel wrote:
> > @@ -1322,11 +1322,16 @@ PAGE_SIZE multiple when read back.
> >       reclaim induced by memory.reclaim.
> >
> >    memory.peak
> > -     A read-only single value file which exists on non-root
> > -     cgroups.
> > +     A read-write single value file which exists on non-root cgroups.
> > +
> > +     The max memory usage recorded for the cgroup and its descendants =
since
> > +     either the creation of the cgroup or the most recent reset for th=
at FD.
> >
> > -     The max memory usage recorded for the cgroup and its
> > -     descendants since the creation of the cgroup.
> > +     A write of the string "reset" to this file resets it to the
> > +     current memory usage for subsequent reads through the same
> > +     file descriptor.
> > +     Attempts to write any other non-empty string will return EINVAL
> > +     (modulo leading and trailing whitespace).
>
> Why not allow any write to reset? This makes it harder to use, and I'm
> not sure accidental writes are a likely mistake to make.

Accepting any string is a rather wide interface, and since actually empty w=
rites
get dropped before this handler is invoked, I think allowing any write
is more confusing.

I figured that a narrower interface is a better starting point, as
long as it's correctly
documented.

We can always widen it to all writes later, but we can't go the other way.

>
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 2150ca60394b..7001ed74e339 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -12,6 +12,7 @@
> >  #include <linux/sched.h>
> >  #include <linux/cpumask.h>
> >  #include <linux/nodemask.h>
> > +#include <linux/list.h>
> >  #include <linux/rculist.h>
> >  #include <linux/cgroupstats.h>
> >  #include <linux/fs.h>
> > @@ -855,4 +856,11 @@ static inline void cgroup_bpf_put(struct cgroup *c=
grp) {}
> >
> >  struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy=
_id);
> >
> > +struct memcg_peak_mem_ctx {
> > +     long                            local_watermark;
> > +     struct list_head                peers;
> > +};
>
> Since this is generic cgroup code, and can be conceivably used by
> other controllers, let's keep the naming generic as well. How about:
>
> struct cgroup_of_peak {
>         long                    value;
>         struct list_head        list;
> };
>
> cgroup-defs.h would be a better place for it.
That is much less jarring.

>
> > +struct memcg_peak_mem_ctx *memcg_extract_peak_mem_ctx(struct kernfs_op=
en_file *of);
>
> of_peak()
>
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 030d34e9d117..cbc390234605 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -198,6 +198,11 @@ struct mem_cgroup {
> >       struct page_counter kmem;               /* v1 only */
> >       struct page_counter tcpmem;             /* v1 only */
> >
> > +     /* lists of memcg peak watching contexts on swap and memory */
> > +     struct list_head peak_memory_local_watermark_watchers;
> > +     struct list_head peak_swap_local_watermark_watchers;
> > +     spinlock_t swap_memory_peak_watchers_lock;
>
> These names are too long. How about:
>
>         /* Registered local usage peak watchers */
>         struct list_head        memory_peaks;
>         struct list_head        swap_peaks;
>         spinlock_t              peaks_lock;
>
That's much better.
(sometimes when I'm unsure about a name, I default to a terrible, long name
with way too much information)

> > diff --git a/include/linux/page_counter.h b/include/linux/page_counter.=
h
> > index 8cd858d912c4..06bb84218960 100644
> > --- a/include/linux/page_counter.h
> > +++ b/include/linux/page_counter.h
> > @@ -26,6 +26,7 @@ struct page_counter {
> >       atomic_long_t children_low_usage;
> >
> >       unsigned long watermark;
> > +     unsigned long local_watermark; /* track min of fd-local resets */
> >       unsigned long failcnt;
> >
> >       /* Keep all the read most fields in a separete cacheline. */
> > @@ -78,7 +79,15 @@ int page_counter_memparse(const char *buf, const cha=
r *max,
> >
> >  static inline void page_counter_reset_watermark(struct page_counter *c=
ounter)
> >  {
> > -     counter->watermark =3D page_counter_read(counter);
> > +     unsigned long cur =3D page_counter_read(counter);
>
> cur -> usage
That's a better name.

>
> > @@ -6907,12 +6912,109 @@ static u64 memory_current_read(struct cgroup_s=
ubsys_state *css,
> >       return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
> >  }
> >
> > -static u64 memory_peak_read(struct cgroup_subsys_state *css,
> > -                         struct cftype *cft)
> > +inline int swap_memory_peak_show(
> > +     struct seq_file *sf, void *v, bool swap_cg)
> >  {
>
> Leave inlining to the compiler. Just static int.
>
> The name can be simply peak_show().

Oh, right, I forgot that since this is local to the translation-unit
it doesn't need to be globally unique.

>
> Customary coding style is to line wrap at the last parameter that
> fits. Don't wrap if the line fits within 80 cols.
>
> static int peak_show(struct seq_file *sf, void *v, ...,
>                      ...)
> {
>         ...
> }
>
Oh yeah, I forgot to move those back to one line when I got rid of the
callback args
while addressing Roman's comments.
Thanks for pointing that out.

> > +     struct cgroup_subsys_state *css =3D seq_css(sf);
> >       struct mem_cgroup *memcg =3D mem_cgroup_from_css(css);
> > +     struct page_counter *pc;
> > +     struct kernfs_open_file *of =3D sf->private;
> > +     struct memcg_peak_mem_ctx *ctx =3D memcg_extract_peak_mem_ctx(of)=
;
> > +     s64 fd_peak =3D ctx->local_watermark;
> >
> > -     return (u64)memcg->memory.watermark * PAGE_SIZE;
> > +     if (swap_cg)
> > +             pc =3D &memcg->swap;
> > +     else
> > +             pc =3D &memcg->memory;
> > +
> > +     if (fd_peak =3D=3D -1) {
> > +             seq_printf(sf, "%llu\n", (u64)pc->watermark * PAGE_SIZE);
> > +             return 0;
> > +     }
> > +
> > +     s64 pc_peak =3D pc->local_watermark;
> > +     s64 wm =3D fd_peak > pc_peak ? fd_peak : pc_peak;
> > +
> > +     seq_printf(sf, "%lld\n", wm * PAGE_SIZE);
> > +     return 0;
> > +}
>
> As per Roman's feedback, don't mix decls and code.

Sorry, that one slipped through. (I've now fixed my syntax highlighting
to recognize s64 as a type)

>
> You can simplify it by extracting css and memcg in the callers, then
> pass the right struct page counter *pc directly.
>
> That should eliminate most local variables as well.
>
> static int peak_show(struct seq_file *sf, void *v, struct page_counter *p=
c)
> {
>         struct cgroup_of_peak *ofp =3D of_peak(sf->private);
>         u64 peak;
>
>         /* User wants global or local peak? */
>         if (ofp->value =3D=3D -1)
>                 peak =3D pc->watermark;
>         else
>                 peak =3D max(ofp->value, pc->local_watermark);
>
>         seq_printf(sf, "%lld\n", peak * PAGE_SIZE);
> }
>
> > +static int memory_peak_show(struct seq_file *sf, void *v)
> > +{
> > +     return swap_memory_peak_show(sf, v, false);
>
> And then do:
>
>         struct mem_cgroup *memcg =3D mem_cgroup_from_css(seq_css(sf));
>
>         return peak_show(sf, v, &memcg->memory);
>
> Then do the same with ... &memcg->swap.
That is much cleaner!
Thanks!
>
> > +inline ssize_t swap_memory_peak_write(
> > +     struct kernfs_open_file *of,
> > +     char *buf, size_t nbytes, loff_t off, bool swap_cg)
> > +{
>
> Same feedback as above. Please don't inline explicitly (unless it
> really is measurably a performance improvement in a critical path),
> and stick to surrounding coding style.
>
> Here too, pass page_counter directly and save the branches.
>
> > +     unsigned long cur;
> > +     struct memcg_peak_mem_ctx *peer_ctx;
> > +     struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of));
> > +     struct memcg_peak_mem_ctx *ctx =3D memcg_extract_peak_mem_ctx(of)=
;
> > +     struct page_counter *pc;
> > +     struct list_head *watchers, *pos;
> > +
> > +     buf =3D strstrip(buf);
> > +     /* Only allow "reset" to keep the API clear */
> > +     if (strcmp(buf, "reset"))
> > +             return -EINVAL;
> > +
> > +     if (swap_cg) {
> > +             pc =3D &memcg->swap;
> > +             watchers =3D &memcg->peak_swap_local_watermark_watchers;
> > +     } else {
> > +             pc =3D &memcg->memory;
> > +             watchers =3D &memcg->peak_memory_local_watermark_watchers=
;
> > +     }
> > +
> > +     spin_lock(&memcg->swap_memory_peak_watchers_lock);
> > +
> > +     page_counter_reset_local_watermark(pc);
> > +     cur =3D pc->local_watermark;
> > +
> > +     list_for_each(pos, watchers) {
>
>         list_for_each_entry()
Oh, I missed that one. Thanks!

>
> > +             peer_ctx =3D list_entry(pos, typeof(*ctx), peers);
> > +             if (cur > peer_ctx->local_watermark)
> > +                     peer_ctx->local_watermark =3D cur;
> > +     }
>
> I don't think this is quite right. local_peak could be higher than the
> current usage when a new watcher shows up. The other watchers should
> retain the higher local_peak, not the current usage.

Since we have to iterate over the list entries while holding the spinlock,
I think we avoid that kind of mismatch by keeping the assignments
inside that critical section as well. (and eliminate a few other races)

As-is, we get a snapshot value of the current usage and use that for
the rest of the function while holding the lock, so, those fd-local
watermarks will only change by assignments in this function.

On the other hand, pc->local_watermark may keep increasing, but that's fine=
.

>
> > +
> > +     if (ctx->local_watermark =3D=3D -1)
> > +             /* only append to the list if we're not already there */
> > +             list_add_tail(&ctx->peers, watchers);
> > +
> > +     ctx->local_watermark =3D cur;
>
> This makes me think that page_counter_reset_local_watermark() is not a
> good helper. It obscures what's going on. Try without it.
>
> AFAICS the list ordering doesn't matter, so keep it simple and use a
> plain list_add().
>
>         /*
>          * A new local peak is being tracked in pc->local_watermark.
>          * Save current local peak in all watchers.
>          */
>         list_for_each_entry(pos, ...)
>                 if (pc->local_watermark > pos->value)
>                         pos->value =3D pc->local_watermark;
>
>         pc->local_watermark =3D page_counter_read(pc);
>
>         /* Initital write, register watcher */
>         if (ofp->value =3D=3D -1)
>                 list_add()
>
>         ofp->value =3D pc->local_watermark;
>
> > diff --git a/mm/page_counter.c b/mm/page_counter.c
> > index db20d6452b71..724d31508664 100644
> > --- a/mm/page_counter.c
> > +++ b/mm/page_counter.c
> > @@ -79,9 +79,22 @@ void page_counter_charge(struct page_counter *counte=
r, unsigned long nr_pages)
> >               /*
> >                * This is indeed racy, but we can live with some
> >                * inaccuracy in the watermark.
> > +              *
> > +              * Notably, we have two watermarks to allow for both a gl=
obally
> > +              * visible peak and one that can be reset at a smaller sc=
ope.
> > +              *
> > +              * Since we reset both watermarks when the global reset o=
ccurs,
> > +              * we can guarantee that watermark >=3D local_watermark, =
so we
> > +              * don't need to do both comparisons every time.
> > +              *
> > +              * On systems with branch predictors, the inner condition=
 should
> > +              * be almost free.
> >                */
> > -             if (new > READ_ONCE(c->watermark))
> > -                     WRITE_ONCE(c->watermark, new);
> > +             if (new > READ_ONCE(c->local_watermark)) {
> > +                     WRITE_ONCE(c->local_watermark, new);
> > +                     if (new > READ_ONCE(c->watermark))
> > +                             WRITE_ONCE(c->watermark, new);
> > +             }
> >       }
> >  }
> >
> > @@ -131,10 +144,23 @@ bool page_counter_try_charge(struct page_counter =
*counter,
> >               propagate_protected_usage(c, new);
> >               /*
> >                * Just like with failcnt, we can live with some
> > -              * inaccuracy in the watermark.
> > +              * inaccuracy in the watermarks.
> > +              *
> > +              * Notably, we have two watermarks to allow for both a gl=
obally
> > +              * visible peak and one that can be reset at a smaller sc=
ope.
> > +              *
> > +              * Since we reset both watermarks when the global reset o=
ccurs,
> > +              * we can guarantee that watermark >=3D local_watermark, =
so we
> > +              * don't need to do both comparisons every time.
> > +              *
> > +              * On systems with branch predictors, the inner condition=
 should
> > +              * be almost free.
>
>                 /* See comment in page_counter_charge() */
Even better!
>
> > diff --git a/tools/testing/selftests/cgroup/cgroup_util.c b/tools/testi=
ng/selftests/cgroup/cgroup_util.c
> > index 432db923bced..1e2d46636a0c 100644
> > --- a/tools/testing/selftests/cgroup/cgroup_util.c
> > +++ b/tools/testing/selftests/cgroup/cgroup_util.c
> > @@ -141,6 +141,16 @@ long cg_read_long(const char *cgroup, const char *=
control)
> >       return atol(buf);
> >  }
>
> This should be in patch #2.
:facepalm: that's what I get for trying to split this and send it out
at the end of the day.


--=20
David Finkel
Senior Principal Software Engineer, Core Services