Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mateusz Guzik <mjguzik@gmail.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	 Tim Chen <tim.c.chen@intel.com>,
	Dave Chinner <dchinner@redhat.com>,
	 "Darrick J. Wong" <djwong@kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	 Carlos Maiolino <cem@kernel.org>,
	Chuck Lever <chuck.lever@oracle.com>, Jan Kara <jack@suse.cz>,
	 Matthew Wilcox <willy@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Axel Rasmussen <axelrasmussen@google.com>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	 linux-mm@kvack.org
Subject: Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
Date: Tue, 28 May 2024 15:44:44 +0200	[thread overview]
Message-ID: <5eemkb4lo5eefp7ijgncgogwmadyzmvjfjmmmvfiki6cwdskfs@hi2z4drqeuz6> (raw)
In-Reply-To: <jh3yqdz43c24ur7w2jjutyvwodsdccefo6ycmtmjyvh25hojn4@aysycyla6pom>

On Sat, May 25, 2024 at 08:00:15AM +0200, Mateusz Guzik wrote:
> On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > Percpu counter's compare and add are separate functions: without locking
> > around them (which would defeat their purpose), it has been possible to
> > overflow the intended limit.  Imagine all the other CPUs fallocating
> > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > 
> > I have not seen reports of that happening; but tmpfs's recent addition
> > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > 
> > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > 
> 
> I think the posted (and by now landed) code is racy.
> 
> I had seen there was a follow up patch which further augmented the
> routine, but it did not alter the issue below so I'm responding to this
> thread.
> 
> > +/*
> > + * Compare counter, and add amount if the total is within limit.
> > + * Return true if amount was added, false if it would exceed limit.
> > + */
> > +bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> > +				  s64 limit, s64 amount, s32 batch)
> > +{
> > +	s64 count;
> > +	s64 unknown;
> > +	unsigned long flags;
> > +	bool good;
> > +
> > +	if (amount > limit)
> > +		return false;
> > +
> > +	local_irq_save(flags);
> > +	unknown = batch * num_online_cpus();
> > +	count = __this_cpu_read(*fbc->counters);
> > +
> > +	/* Skip taking the lock when safe */
> > +	if (abs(count + amount) <= batch &&
> > +	    fbc->count + unknown <= limit) {
> > +		this_cpu_add(*fbc->counters, amount);
> > +		local_irq_restore(flags);
> > +		return true;
> > +	}
> > +
> 
> Note the value of fbc->count is *not* stabilized.
> 
> > +	raw_spin_lock(&fbc->lock);
> > +	count = fbc->count + amount;
> > +
> > +	/* Skip percpu_counter_sum() when safe */
> > +	if (count + unknown > limit) {
> > +		s32 *pcount;
> > +		int cpu;
> > +
> > +		for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> > +			pcount = per_cpu_ptr(fbc->counters, cpu);
> > +			count += *pcount;
> > +		}
> > +	}
> > +
> > +	good = count <= limit;
> > +	if (good) {
> > +		count = __this_cpu_read(*fbc->counters);
> > +		fbc->count += count + amount;
> > +		__this_cpu_sub(*fbc->counters, count);
> > +	}
> > +
> > +	raw_spin_unlock(&fbc->lock);
> > +	local_irq_restore(flags);
> > +	return good;
> > +}
> > +
> 
> Consider 2 cpus executing the func at the same time, where one is
> updating fbc->counter in the slow path while the other is testing it in
> the fast path.
> 
> cpu0						cpu1
> 						if (abs(count + amount) <= batch &&				
> 						    fbc->count + unknown <= limit)
> /* gets past the per-cpu traversal */
> /* at this point cpu0 decided to bump fbc->count, while cpu1 decided to
>  * bump the local counter */
> 							this_cpu_add(*fbc->counters, amount);
> fbc->count += count + amount;
> 
> Suppose fbc->count update puts it close enough to the limit that an
> addition from cpu1 would put the entire thing over said limit.
> 
> Since the 2 operations are not synchronized cpu1 can observe fbc->count
> prior to the bump and update it's local counter, resulting in
> aforementioned overflow.
> 
> Am I misreading something here? Is this prevented?
> 
> To my grep the only user is quotas in shmem and I wonder if that use is
> even justified. I am aware of performance realities of atomics. However,
> it very well may be that whatever codepaths which exercise the counter
> are suffering multicore issues elsewhere, making an atomic (in a
> dedicated cacheline) a perfectly sane choice for the foreseeable future.
> Should this be true there would be 0 rush working out a fixed variant of
> the routine.

So I tried it out and I don't believe a per-cpu counter for mem usage of
shmem is warranted at this time. In a setting where usage keeps changing
there is a massive bottleneck around folio_lruvec_lock.

The rare case where a bunch of memory is just populated one off on a
tmpfs mount can probably suffer the atomic, it can only go so far up
before you are out of memory.

For the value to keep changing some of the pages will have to be
released and this is what I'm testing below by slapping ftruncate on a
test doing 1MB file writes in a loop (in will-it-scale):

diff --git a/tests/write1.c b/tests/write1.c
index d860904..790ddb3 100644
--- a/tests/write1.c
+++ b/tests/write1.c
@@ -28,6 +28,7 @@ void testcase(unsigned long long *iterations, unsigned long nr)
                        size = 0;
                        lseek(fd, 0, SEEK_SET);
                }
+               ftruncate(fd, 0);

                (*iterations)++;
        }

That this is allocates pages for 1MB size and releases them continously.

When mounting tmpfs on /tmp and benching with "./write1_processes -t 20"
(20 workers) I see almost all of the time spent spinning in
__pv_queued_spin_lock_slowpath.

As such I don't believe the per-cpu split buys anything in terms of
scalability and as I explained in the previous mail I think the routine
used here is buggy, while shmem is the only user. Should this switch to
a centralized atomic (either cmpxchg loop or xadd + backpedal) there
should be no loss in scalability given the lruvec problem. Then the
routine could be commented out or whacked for the time being.

backtraces for interested:

bpftrace -e 'kprobe:__pv_queued_spin_lock_slowpath { @[kstack()] = count(); }'

@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+61
    folio_lruvec_lock_irqsave+95
    __page_cache_release+130
    folios_put_refs+139
    shmem_undo_range+702
    shmem_setattr+983
    notify_change+556
    do_truncate+131
    do_ftruncate+254
    __x64_sys_ftruncate+62
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+118
]: 4750931
@[
    __pv_queued_spin_lock_slowpath+5
    _raw_spin_lock_irqsave+61
    folio_lruvec_lock_irqsave+95
    folio_batch_move_lru+139
    lru_add_drain_cpu+141
    __folio_batch_release+49
    shmem_undo_range+702
    shmem_setattr+983
    notify_change+556
    do_truncate+131
    do_ftruncate+254
    __x64_sys_ftruncate+62
    do_syscall_64+87
    entry_SYSCALL_64_after_hwframe+118
]: 4761483

     prev parent reply	other threads:[~2024-05-28 13:46 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-30  3:23 [PATCH 0/8] shmem,tmpfs: general maintenance Hugh Dickins
2023-09-30  3:25 ` [PATCH 1/8] shmem: shrink shmem_inode_info: dir_offsets in a union Hugh Dickins
2023-09-30 16:16   ` Chuck Lever
2023-10-03 13:06   ` Jan Kara
2023-09-30  3:26 ` [PATCH 2/8] shmem: remove vma arg from shmem_get_folio_gfp() Hugh Dickins
2023-10-03 13:07   ` Jan Kara
2023-09-30  3:27 ` [PATCH 3/8] shmem: factor shmem_falloc_wait() out of shmem_fault() Hugh Dickins
2023-10-03 13:18   ` Jan Kara
2023-10-06  3:48     ` Hugh Dickins
2023-10-06 11:01       ` Jan Kara
2023-09-30  3:28 ` [PATCH 4/8] shmem: trivial tidyups, removing extra blank lines, etc Hugh Dickins
2023-10-03 13:20   ` Jan Kara
2023-09-30  3:30 ` [PATCH 5/8] shmem: shmem_acct_blocks() and shmem_inode_acct_blocks() Hugh Dickins
2023-10-03 13:21   ` Jan Kara
2023-09-30  3:31 ` [PATCH 6/8] shmem: move memcg charge out of shmem_add_to_page_cache() Hugh Dickins
2023-10-03 13:28   ` Jan Kara
2023-09-30  3:32 ` [PATCH 7/8] shmem: _add_to_page_cache() before shmem_inode_acct_blocks() Hugh Dickins
2023-09-30  3:42 ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Hugh Dickins
2023-10-04 15:32   ` Jan Kara
2023-10-04 23:10   ` Dave Chinner
2023-10-06  5:35     ` Hugh Dickins
2023-10-09  0:15       ` Dave Chinner
2023-10-12  4:36         ` Hugh Dickins
2023-10-12  4:40           ` [PATCH 9/8] percpu_counter: extend _limited_add() to negative amounts Hugh Dickins
2023-10-05 16:50   ` [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount) Chen, Tim C
2023-10-06  5:42     ` Hugh Dickins
2023-10-06 22:59       ` Dennis Zhou
2023-10-12  4:09         ` Hugh Dickins
     [not found]   ` <jh3yqdz43c24ur7w2jjutyvwodsdccefo6ycmtmjyvh25hojn4@aysycyla6pom>
2024-05-28 13:44     ` Mateusz Guzik [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5eemkb4lo5eefp7ijgncgogwmadyzmvjfjmmmvfiki6cwdskfs@hi2z4drqeuz6 \
    --to=mjguzik@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=brauner@kernel.org \
    --cc=cem@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tim.c.chen@intel.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox