Re: [PATCH 0/2] execve scalability issues, part 1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: Dennis Zhou <dennis@kernel.org>
Cc: Jan Kara <jack@suse.cz>, Mateusz Guzik <mjguzik@gmail.com>,
	linux-kernel@vger.kernel.org, tj@kernel.org, cl@linux.com,
	akpm@linux-foundation.org, shakeelb@google.com,
	linux-mm@kvack.org
Subject: Re: [PATCH 0/2] execve scalability issues, part 1
Date: Thu, 24 Aug 2023 11:19:56 +0200	[thread overview]
Message-ID: <20230824091956.drn6ucixj4qbxwa7@quack3> (raw)
In-Reply-To: <ZOZrzG/MgL8vw+lI@snowbird>

On Wed 23-08-23 13:27:56, Dennis Zhou wrote:
> On Wed, Aug 23, 2023 at 11:49:15AM +0200, Jan Kara wrote:
> > On Tue 22-08-23 16:24:56, Mateusz Guzik wrote:
> > > On 8/22/23, Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 22-08-23 00:29:49, Mateusz Guzik wrote:
> > > >> On 8/21/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > >> > True Fix(tm) is a longer story.
> > > >> >
> > > >> > Maybe let's sort out this patchset first, whichever way. :)
> > > >> >
> > > >>
> > > >> So I found the discussion around the original patch with a perf
> > > >> regression report.
> > > >>
> > > >> https://lore.kernel.org/linux-mm/20230608111408.s2minsenlcjow7q3@quack3/
> > > >>
> > > >> The reporter suggests dodging the problem by only allocating per-cpu
> > > >> counters when the process is going multithreaded. Given that there is
> > > >> still plenty of forever single-threaded procs out there I think that's
> > > >> does sound like a great plan regardless of what happens with this
> > > >> patchset.
> > > >>
> > > >> Almost all access is already done using dedicated routines, so this
> > > >> should be an afternoon churn to sort out, unless I missed a
> > > >> showstopper. (maybe there is no good place to stuff a flag/whatever
> > > >> other indicator about the state of counters?)
> > > >>
> > > >> That said I'll look into it some time this or next week.
> > > >
> > > > Good, just let me know how it went, I also wanted to start looking into
> > > > this to come up with some concrete patches :). What I had in mind was that
> > > > we could use 'counters == NULL' as an indication that the counter is still
> > > > in 'single counter mode'.
> > > >
> > > 
> > > In the current state there are only pointers to counters in mm_struct
> > > and there is no storage for them in task_struct. So I don't think
> > > merely null-checking the per-cpu stuff is going to cut it -- where
> > > should the single-threaded counters land?
> > 
> > I think you misunderstood. What I wanted to do it to provide a new flavor
> > of percpu_counter (sharing most of code and definitions) which would have
> > an option to start as simple counter (indicated by pcc->counters == NULL
> > and using pcc->count for counting) and then be upgraded by a call to real
> > percpu thing. Because I think such counters would be useful also on other
> > occasions than as rss counters.
> > 
> 
> Kent wrote something similar and sent it out last year [1]. However, the
> case slightly differs from what we'd want here, 1 -> 2 threads becomes
> percpu vs update rate which a single thread might be able to trigger?

Thanks for the pointer but that version of counters is not really suitable
here as is (but we could factor out some common bits if that work is
happening). 1 thread can easily do 10000 RSS updates per second.

> [1] https://lore.kernel.org/lkml/20230501165450.15352-8-surenb@google.com/

								Honza

> > > Bonus problem, non-current can modify these counters and this needs to
> > > be safe against current playing with them at the same time. (and it
> > > would be a shame to require current to use atomic on them)
> > 
> > Hum, I didn't realize that. Indeed I can see that e.g. khugepaged can be
> > modifying the counters for other processes. Thanks for pointing this out.
> > 
> > > That said, my initial proposal adds a union:
> > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > > index 5e74ce4a28cd..ea70f0c08286 100644
> > > --- a/include/linux/mm_types.h
> > > +++ b/include/linux/mm_types.h
> > > @@ -737,7 +737,11 @@ struct mm_struct {
> > > 
> > >                 unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for
> > > /proc/PID/auxv */
> > > 
> > > -               struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > +               union {
> > > +                       struct percpu_counter rss_stat[NR_MM_COUNTERS];
> > > +                       u64 *rss_stat_single;
> > > +               };
> > > +               bool    magic_flag_stuffed_elsewhere;
> > > 
> > >                 struct linux_binfmt *binfmt;
> > > 
> > > 
> > > Then for single-threaded case an area is allocated for NR_MM_COUNTERS
> > > countes * 2 -- first set updated without any synchro by current
> > > thread. Second set only to be modified by others and protected with
> > > mm->arg_lock. The lock protects remote access to the union to begin
> > > with.
> > 
> > arg_lock seems a bit like a hack. How is it related to rss_stat? The scheme
> > with two counters is clever but I'm not 100% convinced the complexity is
> > really worth it. I'm not sure the overhead of always using an atomic
> > counter would really be measurable as atomic counter ops in local CPU cache
> > tend to be cheap. Did you try to measure the difference?
> > 
> > If the second counter proves to be worth it, we could make just that one
> > atomic to avoid the need for abusing some spinlock.
> > 
> > > Transition to per-CPU operation sets the magic flag (there is plenty
> > > of spare space in mm_struct, I'll find a good home for it without
> > > growing the struct). It would be a one-way street -- a process which
> > > gets a bunch of threads and goes back to one stays with per-CPU.
> > 
> > Agreed with switching to be a one-way street.
> > 
> > > Then you get the true value of something by adding both counters.
> > > 
> > > arg_lock is sparingly used, so remote ops are not expected to contend
> > > with anything. In fact their cost is going to go down since percpu
> > > summation takes a spinlock which also disables interrupts.
> > > 
> > > Local ops should be about the same in cost as they are right now.
> > > 
> > > I might have missed some detail in the above description, but I think
> > > the approach is decent.
> > 
> > 								Honza
> > -- 
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2023-08-24  9:20 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-08-21 20:28 Mateusz Guzik
2023-08-21 20:28 ` [PATCH 1/2] pcpcntr: add group allocation/free Mateusz Guzik
2023-08-22 13:37   ` Vegard Nossum
2023-08-22 14:06     ` Mateusz Guzik
2023-08-22 17:02   ` Dennis Zhou
2023-08-21 20:28 ` [PATCH 2/2] fork: group allocation of per-cpu counters for mm struct Mateusz Guzik
2023-08-21 21:20   ` Matthew Wilcox
2023-08-21 20:42 ` [PATCH 0/2] execve scalability issues, part 1 Matthew Wilcox
2023-08-21 20:44   ` [PATCH 1/7] mm: Make folios_put() the basis of release_pages() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 2/7] mm: Convert free_unref_page_list() to use folios Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 3/7] mm: Add free_unref_folios() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 4/7] mm: Use folios_put() in __folio_batch_release() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 5/7] memcg: Add mem_cgroup_uncharge_batch() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 6/7] mm: Remove use of folio list from folios_put() Matthew Wilcox (Oracle)
2023-08-21 20:44     ` [PATCH 7/7] mm: Use free_unref_folios() in put_pages_list() Matthew Wilcox (Oracle)
2023-08-21 21:07 ` [PATCH 0/2] execve scalability issues, part 1 Dennis Zhou
2023-08-21 21:39   ` Mateusz Guzik
2023-08-21 22:29     ` Mateusz Guzik
2023-08-22  9:51       ` Jan Kara
2023-08-22 14:24         ` Mateusz Guzik
2023-08-23  9:49           ` Jan Kara
2023-08-23 10:49             ` David Laight
2023-08-23 12:01               ` Mateusz Guzik
2023-08-23 12:13             ` Mateusz Guzik
2023-08-23 15:47               ` Jan Kara
2023-08-23 16:10                 ` Mateusz Guzik
2023-08-23 16:41                   ` Jan Kara
2023-08-23 17:12                     ` Mateusz Guzik
2023-08-23 20:27             ` Dennis Zhou
2023-08-24  9:19               ` Jan Kara [this message]
2023-08-26 18:33 ` Mateusz Guzik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230824091956.drn6ucixj4qbxwa7@quack3 \
    --to=jack@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=dennis@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjguzik@gmail.com \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox