Scaling problem with shmem_sb_info->stat

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Scaling problem with shmem_sb_info->stat_lock
@ 2004-07-12 21:11 Brent Casavant
  2004-07-12 21:55 ` William Lee Irwin III
  0 siblings, 1 reply; 24+ messages in thread
From: Brent Casavant @ 2004-07-12 21:11 UTC (permalink / raw)
  To: hugh; +Cc: linux-mm

Hugh,

Christoph Hellwig recommended I email you about this issue.

In the Linux kernel, in both 2.4 and 2.6, in the mm/shmem.c code, there
is a stat_lock in the shmem_sb_info structure which protects (among other
things) the free_blocks field and (under 2.6) the inode i_blocks field.

At SGI we've found that on larger systems (>32P) undergoing parallel
/dev/zero page faulting, as often happens during parallel application
startup, this locking does not scale very well due to the lock cacheline
bouncing between CPUs.

Back in 2.4 Jack Steiner hacked on this code to avoid taking the lock
when free_blocks was equal to ULONG_MAX, as it makes little sense to
perform bookkeeping operations when there were no practical limits
being requested.  This (along with scaling fixes in other parts of the
VM system) provided for very good scaling of /dev/zero page faulting.
However, this could lead to problems in the shmem_set_size() function
during a remount operation; but as that operation is apparently fairly
rare on running systems, it solved the scaling problem in practice.

I've hacked up the 2.6 shmem.c code to not require the stat_lock to
be taken while accessing these two fields (free_blocks and i_blocks),
but unfortunately this does nothing more than change which cacheline
is bouncing around the system (the fields themselves, instead of
the lock).  This of course was not unexpected.

Looking at this code, I don't see any straightforward way to alleviate
this problem.  So, I was wondering if you might have any ideas how one
might approach this.  I'm hoping for something that will give us good
scaling all the way up to 512P.

Thanks,
Brent Casavant

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-12 21:11 Scaling problem with shmem_sb_info->stat_lock Brent Casavant
@ 2004-07-12 21:55 ` William Lee Irwin III
  2004-07-12 22:42   ` Brent Casavant
  0 siblings, 1 reply; 24+ messages in thread
From: William Lee Irwin III @ 2004-07-12 21:55 UTC (permalink / raw)
  To: Brent Casavant; +Cc: hugh, linux-mm

On Mon, Jul 12, 2004 at 04:11:29PM -0500, Brent Casavant wrote:
> Looking at this code, I don't see any straightforward way to alleviate
> this problem.  So, I was wondering if you might have any ideas how one
> might approach this.  I'm hoping for something that will give us good
> scaling all the way up to 512P.

Smells like per-cpu split counter material to me.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-12 21:55 ` William Lee Irwin III
@ 2004-07-12 22:42   ` Brent Casavant
  2004-07-13 19:56     ` Brent Casavant
  0 siblings, 1 reply; 24+ messages in thread
From: Brent Casavant @ 2004-07-12 22:42 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: hugh, linux-mm

On Mon, 12 Jul 2004, William Lee Irwin III wrote:

> On Mon, Jul 12, 2004 at 04:11:29PM -0500, Brent Casavant wrote:
> > Looking at this code, I don't see any straightforward way to alleviate
> > this problem.  So, I was wondering if you might have any ideas how one
> > might approach this.  I'm hoping for something that will give us good
> > scaling all the way up to 512P.
>
> Smells like per-cpu split counter material to me.

It does to me too, particularly since the only time the i_blocks and
free_blocks fields are used for something other than bookkeeping updates
is the inode getattr operation, the superblock statfs operation, and the
shmem_set_size() function which is called infrequently.

The complication with this is that we'd either need to redefine
i_blocks in the inode structure (somehow I don't see that happening),
or move that field up into the shmem_inode_info structure and make
the necessary code adjustments.

I tried the latter approach (without splitting the counter) in one
attempt, but never quite got the code right.  I couldn't spot my bug,
but it was sort a one-off hack-job anyway, and got me the information
I needed at the time.

Thanks,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-12 22:42   ` Brent Casavant
@ 2004-07-13 19:56     ` Brent Casavant
  2004-07-13 20:41       ` Hugh Dickins
  0 siblings, 1 reply; 24+ messages in thread
From: Brent Casavant @ 2004-07-13 19:56 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: hugh, linux-mm

On Mon, 12 Jul 2004, Brent Casavant wrote:

> The complication with this is that we'd either need to redefine
> i_blocks in the inode structure (somehow I don't see that happening),
> or move that field up into the shmem_inode_info structure and make
> the necessary code adjustments.

Better idea, maybe.

Jack Steiner suggested to me that we really don't care about accounting
for i_blocks and free_blocks for /dev/zero mappings (question: Is he
right?).

If so, then it seems to me we could turn on a bit in the flags field
of the shmem_inode_info structure that says "don't bother with bookkeeping
for me".  We can then test for that flag wherever i_blocks and free_blocks
are updated, and omit the update if appropriate.  This leaves tmpfs
working appropriately for its "filesystem" role, and avoids the
cacheline bouncing problem for its "shared /dev/zero mappings" role.

Assuming this is correct, I imagine I should just snag the next
bit in the flags field (bit 0 is SHMEM_PAGEIN (== VM_READ) and
bit 1 is SHMEM_TRUNCATE (== VM_WRITE), I'd use bit 2 for
SHMEM_NOACCT (== VM_EXEC)) and run with this idea, right?

Thoughts?
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-13 19:56     ` Brent Casavant
@ 2004-07-13 20:41       ` Hugh Dickins
  2004-07-13 21:35         ` Brent Casavant
                           ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Hugh Dickins @ 2004-07-13 20:41 UTC (permalink / raw)
  To: Brent Casavant; +Cc: William Lee Irwin III, linux-mm

On Tue, 13 Jul 2004, Brent Casavant wrote:
> On Mon, 12 Jul 2004, Brent Casavant wrote:
> 
> > The complication with this is that we'd either need to redefine
> > i_blocks in the inode structure (somehow I don't see that happening),
> > or move that field up into the shmem_inode_info structure and make
> > the necessary code adjustments.
> 
> Better idea, maybe.
> 
> Jack Steiner suggested to me that we really don't care about accounting
> for i_blocks and free_blocks for /dev/zero mappings (question: Is he
> right?).

I think Jack's right: there's no visible mount point for df or du,
the files come ready-unlinked, nobody has an fd.

Though wli's per-cpu idea was sensible enough, converting to that
didn't appeal to me very much.  We only have a limited amount of
per-cpu space, I think, but an indefinite number of tmpfs mounts.
Might be reasonable to allow per-cpu for 4 or them (the internal
one which is troubling you, /dev/shm, /tmp and one other).  Tiresome.

Jack's perception appeals to me much more
(but, like you, I do wonder if it'll really work out in practice).

> If so, then it seems to me we could turn on a bit in the flags field
> of the shmem_inode_info structure that says "don't bother with bookkeeping
> for me".  We can then test for that flag wherever i_blocks and free_blocks
> are updated, and omit the update if appropriate.  This leaves tmpfs
> working appropriately for its "filesystem" role, and avoids the
> cacheline bouncing problem for its "shared /dev/zero mappings" role.
> 
> Assuming this is correct, I imagine I should just snag the next
> bit in the flags field (bit 0 is SHMEM_PAGEIN (== VM_READ) and
> bit 1 is SHMEM_TRUNCATE (== VM_WRITE), I'd use bit 2 for
> SHMEM_NOACCT (== VM_EXEC)) and run with this idea, right?

Yes, go ahead, though it's getting more and more embarrassing that I
started out reusing VM_ACCOUNT within shmem.c, it should now have its
own set of flags: let me tidy that up once you're done.  (Something
else I should do for your scalability is stop putting everything on
on the shmem_inodes list: that's only needed when pages are on swap.)

But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers
to the security_vm_enough_memory/vm_unacct_memory stuff throughout,
and _that_ accounting does still apply to these /dev/zero files.

Hmm, I was about to suggest SHMEM_NOSBINFO,
but how about really no sbinfo, just NULL sbinfo?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-13 20:41       ` Hugh Dickins
@ 2004-07-13 21:35         ` Brent Casavant
  2004-07-13 22:50           ` William Lee Irwin III
  2004-07-13 22:22         ` William Lee Irwin III
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Brent Casavant @ 2004-07-13 21:35 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: William Lee Irwin III, linux-mm

On Tue, 13 Jul 2004, Hugh Dickins wrote:

> Though wli's per-cpu idea was sensible enough, converting to that
> didn't appeal to me very much.  We only have a limited amount of
> per-cpu space, I think, but an indefinite number of tmpfs mounts.
> Might be reasonable to allow per-cpu for 4 or them (the internal
> one which is troubling you, /dev/shm, /tmp and one other).  Tiresome.

Per-CPU has the problem that the CPU on which you did a free_blocks++
might not be the same one where you do a free_blocks--.  Bleh.

Maybe using a hash indexed on some tid bits (pun unintended, but funny
nevertheless) might work?  But of course this suffers from the same
class of problem as mentioned in the previous paragraph.

> Yes, go ahead, though it's getting more and more embarrassing that I
> started out reusing VM_ACCOUNT within shmem.c, it should now have its
> own set of flags: let me tidy that up once you're done.

Hmm. Guess that means I need to crack the whip on myself a bit... :)

> But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers
> to the security_vm_enough_memory/vm_unacct_memory stuff throughout,
> and _that_ accounting does still apply to these /dev/zero files.
>
> Hmm, I was about to suggest SHMEM_NOSBINFO,
> but how about really no sbinfo, just NULL sbinfo?

If you'd like me to try that, I sure can.  The only problem is that
I'm having a devil of a time figuring out where the struct super_block
comes from for /dev/null -- or heck, if it's even distinct from any
others.  And the relationship between /dev/null and /dev/shm is still
quite fuzzy as well.  Oh the joy of being new to a chunk of code...

Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-13 20:41       ` Hugh Dickins
  2004-07-13 21:35         ` Brent Casavant
@ 2004-07-13 22:22         ` William Lee Irwin III
  2004-07-13 22:27         ` Brent Casavant
  2004-07-28  9:26         ` Andrew Morton
  3 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2004-07-13 22:22 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Brent Casavant, linux-mm

On Tue, Jul 13, 2004 at 09:41:34PM +0100, Hugh Dickins wrote:
> I think Jack's right: there's no visible mount point for df or du,
> the files come ready-unlinked, nobody has an fd.
> Though wli's per-cpu idea was sensible enough, converting to that
> didn't appeal to me very much.  We only have a limited amount of
> per-cpu space, I think, but an indefinite number of tmpfs mounts.
> Might be reasonable to allow per-cpu for 4 or them (the internal
> one which is troubling you, /dev/shm, /tmp and one other).  Tiresome.
> Jack's perception appeals to me much more
> (but, like you, I do wonder if it'll really work out in practice).

I ignored the specific usage case and looked only at the generic one.
Though I actually had in mind just shoving an array of cachelines in
the per-sb structure, it apparently is not even useful to maintain for
the case in question, so why bother?.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-13 20:41       ` Hugh Dickins
  2004-07-13 21:35         ` Brent Casavant
  2004-07-13 22:22         ` William Lee Irwin III
@ 2004-07-13 22:27         ` Brent Casavant
  2004-07-28  9:26         ` Andrew Morton
  3 siblings, 0 replies; 24+ messages in thread
From: Brent Casavant @ 2004-07-13 22:27 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: William Lee Irwin III, linux-mm

On Tue, 13 Jul 2004, Hugh Dickins wrote:

> On Tue, 13 Jul 2004, Brent Casavant wrote:

> > Assuming this is correct, I imagine I should just snag the next
> > bit in the flags field (bit 0 is SHMEM_PAGEIN (== VM_READ) and
> > bit 1 is SHMEM_TRUNCATE (== VM_WRITE), I'd use bit 2 for
> > SHMEM_NOACCT (== VM_EXEC)) and run with this idea, right?
>
> Yes, go ahead, though it's getting more and more embarrassing that I
> started out reusing VM_ACCOUNT within shmem.c, it should now have its
> own set of flags: let me tidy that up once you're done.  (Something
> else I should do for your scalability is stop putting everything on
> on the shmem_inodes list: that's only needed when pages are on swap.)
>
> But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers
> to the security_vm_enough_memory/vm_unacct_memory stuff throughout,
> and _that_ accounting does still apply to these /dev/zero files.
>
> Hmm, I was about to suggest SHMEM_NOSBINFO,
> but how about really no sbinfo, just NULL sbinfo?

OK, I gave this a try (calling it SHMEM_NOSBINFO).  It seems to work
functionally.  I can't get time on our 512P until tomorrow morning (CDT),
so I'll hold off on the patch until I've seen that it really fixes the
problem.

I'd really like to volunteer to do the work to have a NULL sbinfo
entirely.  But that might take a lot more time to accomplish as
I'm still puzzled by how all these pieces interact.

Thanks,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-13 21:35         ` Brent Casavant
@ 2004-07-13 22:50           ` William Lee Irwin III
  0 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2004-07-13 22:50 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Hugh Dickins, linux-mm

On Tue, 13 Jul 2004, Hugh Dickins wrote:
>> Though wli's per-cpu idea was sensible enough, converting to that
>> didn't appeal to me very much.  We only have a limited amount of
>> per-cpu space, I think, but an indefinite number of tmpfs mounts.
>> Might be reasonable to allow per-cpu for 4 or them (the internal
>> one which is troubling you, /dev/shm, /tmp and one other).  Tiresome.

On Tue, Jul 13, 2004 at 04:35:25PM -0500, Brent Casavant wrote:
> Per-CPU has the problem that the CPU on which you did a free_blocks++
> might not be the same one where you do a free_blocks--.  Bleh.
> Maybe using a hash indexed on some tid bits (pun unintended, but funny
> nevertheless) might work?  But of course this suffers from the same
> class of problem as mentioned in the previous paragraph.

This is a non-issue. Full-fledged implementations of per-cpu counters
must be either insensitive to or explicitly handle underflow. There are
several different ways to do this; I think there's one in a kernel
header that's an example of batched spills to and borrows from a global
counter (note that the batches are O(NR_CPUS); important for reducing
the arrival rate).  Another way would be to steal from other cpus
analogous to how the scheduler steals tasks. There's one in the
scheduler I did, rq->nr_uninterruptible, that is insensitive to
underflow; the values are only examined in summation, used for load
average calculations. It makes some sense, too, as sleeping tasks
aren't actually associated with runqueues, and so the per-runqueue
values wouldn't be meaningful.

I guess since that's not how it's being addressed anyway, it's academic.
It may make some kind of theoretical sense for e.g. databases on
similarly large cpu count systems, but in truth machines sensitive to
this issue are just not used for such and would have far worse and more
severe performance problems elsewhere, so again, why bother?

On Tue, 13 Jul 2004, Hugh Dickins wrote:
>> But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers
>> to the security_vm_enough_memory/vm_unacct_memory stuff throughout,
>> and _that_ accounting does still apply to these /dev/zero files.
>> Hmm, I was about to suggest SHMEM_NOSBINFO,
>> but how about really no sbinfo, just NULL sbinfo?

On Tue, Jul 13, 2004 at 04:35:25PM -0500, Brent Casavant wrote:
> If you'd like me to try that, I sure can.  The only problem is that
> I'm having a devil of a time figuring out where the struct super_block
> comes from for /dev/null -- or heck, if it's even distinct from any
> others.  And the relationship between /dev/null and /dev/shm is still
> quite fuzzy as well.  Oh the joy of being new to a chunk of code...

There is a global "anonymous mount" of tmpfs used to implement e.g.
MAP_SHARED mappings of /dev/zero, SysV shm, etc. This mounted fs is
not associated with any point in the fs namespace. So it's distinct
from all other mounted instances that are e.g. associated with
mountpoints in the fs namespace, and potentially even independent
kern_mount()'d instances, though I know of no others apart from the one
used in shmem.c, and they'd be awkward to arrange (static funcs & vars).
This is just a convenience for setting up unlinked inodes etc. and can
in principle be done without, which would remove even more forms of
global state maintenance.

-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-13 20:41       ` Hugh Dickins
                           ` (2 preceding siblings ...)
  2004-07-13 22:27         ` Brent Casavant
@ 2004-07-28  9:26         ` Andrew Morton
  2004-07-28  9:59           ` William Lee Irwin III
  3 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2004-07-28  9:26 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: bcasavan, wli, linux-mm

Hugh Dickins <hugh@veritas.com> wrote:
>
> Though wli's per-cpu idea was sensible enough, converting to that
>  didn't appeal to me very much.  We only have a limited amount of
>  per-cpu space, I think, but an indefinite number of tmpfs mounts.

What's wrong with <linux/percpu_counter.h>?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28  9:26         ` Andrew Morton
@ 2004-07-28  9:59           ` William Lee Irwin III
  2004-07-28 22:21             ` Brent Casavant
  0 siblings, 1 reply; 24+ messages in thread
From: William Lee Irwin III @ 2004-07-28  9:59 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, bcasavan, linux-mm

Hugh Dickins <hugh@veritas.com> wrote:
>> Though wli's per-cpu idea was sensible enough, converting to that
>>  didn't appeal to me very much.  We only have a limited amount of
>>  per-cpu space, I think, but an indefinite number of tmpfs mounts.

On Wed, Jul 28, 2004 at 02:26:25AM -0700, Andrew Morton wrote:
> What's wrong with <linux/percpu_counter.h>?

One issue with using it for the specific cases in question is that the
maintenance of the statistics is entirely unnecessary for them.

For the general case it may still make sense to do this. SGI will have
to comment here, as the workloads I'm involved with are kernel intensive
enough in other areas and generally run on small enough systems to have
no visible issues in or around the areas described.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28  9:59           ` William Lee Irwin III
@ 2004-07-28 22:21             ` Brent Casavant
  2004-07-28 23:05               ` Andrew Morton
                                 ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Brent Casavant @ 2004-07-28 22:21 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, Hugh Dickins, linux-mm

On Wed, 28 Jul 2004, William Lee Irwin III wrote:

> Hugh Dickins <hugh@veritas.com> wrote:
> >> Though wli's per-cpu idea was sensible enough, converting to that
> >>  didn't appeal to me very much.  We only have a limited amount of
> >>  per-cpu space, I think, but an indefinite number of tmpfs mounts.
>
> On Wed, Jul 28, 2004 at 02:26:25AM -0700, Andrew Morton wrote:
> > What's wrong with <linux/percpu_counter.h>?
>
> One issue with using it for the specific cases in question is that the
> maintenance of the statistics is entirely unnecessary for them.

Yeah.  Hugh solved the stat_lock issue by getting rid of the superblock
info for the internal superblock(s?) corresponding to /dev/zero and
System V shared memory.  There was no way to get at that information
anyway, so it wasn't useful to pay to keep it around.

> For the general case it may still make sense to do this. SGI will have
> to comment here, as the workloads I'm involved with are kernel intensive
> enough in other areas and generally run on small enough systems to have
> no visible issues in or around the areas described.

With Hugh's fix, the problem has now moved to other areas -- I consider
the stat_lock issue solved.  Now I'm running up against the shmem_inode_info
lock field.  A per-CPU structure isn't appropriate here because what it's
mostly protecting is the inode swap entries, and that isn't at all amenable
to a per-CPU breakdown (i.e. this is real data, not statistics).

The "obvious" fix is to morph the code so that the swap entries can be
updated in parallel to eachother and in parallel to the other miscellaneous
fields in the shmem_inode_info structure.  But this would be one *nasty*
piece of work to accomplish, much less accomplish cleanly and correctly.
I'm pretty sure my Linux skillset isn't up to the task, though it hasn't
kept me from trying.  On the upside I don't think it would significantly
impact performance on low processor-count systems, if we can manage to
do it at all.

I'm kind of hoping for a fairy godmother to drop in, wave her magic wand,
and say "Here's the quick and easy and obviously correct solution".  But
what're the chances of that :).

Thanks,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28 22:21             ` Brent Casavant
@ 2004-07-28 23:05               ` Andrew Morton
  2004-07-28 23:40                 ` Brent Casavant
  2004-07-28 23:53               ` William Lee Irwin III
  2004-07-29 19:58               ` Hugh Dickins
  2 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2004-07-28 23:05 UTC (permalink / raw)
  To: Brent Casavant; +Cc: wli, hugh, linux-mm

Brent Casavant <bcasavan@sgi.com> wrote:
>
> Now I'm running up against the shmem_inode_info
> lock field.

Normally a per-inode lock doesn't hurt too much because it's rare
for lots of tasks to whack on the same inode at the same time.

I guess with tmpfs-backed-shm, we have a rare workload.  How
unpleasant.

> I'm kind of hoping for a fairy godmother to drop in, wave her magic wand,
> and say "Here's the quick and easy and obviously correct solution".  But
> what're the chances of that :).

Oh, sending email to Hugh is one of my favourite problem-solving techniques.
Grab a beer and sit back.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28 23:05               ` Andrew Morton
@ 2004-07-28 23:40                 ` Brent Casavant
  2004-07-28 23:53                   ` William Lee Irwin III
  0 siblings, 1 reply; 24+ messages in thread
From: Brent Casavant @ 2004-07-28 23:40 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm

On Wed, 28 Jul 2004, Andrew Morton wrote:

> Brent Casavant <bcasavan@sgi.com> wrote:

> Normally a per-inode lock doesn't hurt too much because it's rare
> for lots of tasks to whack on the same inode at the same time.
>
> I guess with tmpfs-backed-shm, we have a rare workload.  How
> unpleasant.

Well, it's really not even a common workload for tmpfs-backed-shm, where
common means "non-HPC".  Where SGI ran into this problem is with MPI
startup.  Our workaround at this time is to replace one large /dev/zero
mapping shared amongst many forked processes (e.g. one process per CPU)
with a bunch of single-page mappings of the same total size.  This
apparently has the effect of breaking the mapping up into multiple inodes,
and reduces contention for any particular inode lock.

But that's an ugly hack, and we really want to get rid of it.  I may be
talking out my rear, but I suspect that this will cause issues elsewhere
(e.g. lots of tiny VM regions to track, which can be painful at
fork/exec/exit time [if my IRIX experience serves me well]).  I can look
into the specifics of the workaround and probably provide numbers if
anyone is really interested in such things at this point.

Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28 23:40                 ` Brent Casavant
@ 2004-07-28 23:53                   ` William Lee Irwin III
  0 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2004-07-28 23:53 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andrew Morton, linux-mm

On Wed, Jul 28, 2004 at 06:40:40PM -0500, Brent Casavant wrote:
> Well, it's really not even a common workload for tmpfs-backed-shm, where
> common means "non-HPC".  Where SGI ran into this problem is with MPI
> startup.  Our workaround at this time is to replace one large /dev/zero
> mapping shared amongst many forked processes (e.g. one process per CPU)
> with a bunch of single-page mappings of the same total size.  This
> apparently has the effect of breaking the mapping up into multiple inodes,
> and reduces contention for any particular inode lock.
> But that's an ugly hack, and we really want to get rid of it.  I may be
> talking out my rear, but I suspect that this will cause issues elsewhere
> (e.g. lots of tiny VM regions to track, which can be painful at
> fork/exec/exit time [if my IRIX experience serves me well]).  I can look
> into the specifics of the workaround and probably provide numbers if
> anyone is really interested in such things at this point.

I'm very interested. I have similar issues with per-inode locks in other
contexts. This one is bound to factor in as well.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28 22:21             ` Brent Casavant
  2004-07-28 23:05               ` Andrew Morton
@ 2004-07-28 23:53               ` William Lee Irwin III
  2004-07-29 14:54                 ` Brent Casavant
  2004-07-29 19:58               ` Hugh Dickins
  2 siblings, 1 reply; 24+ messages in thread
From: William Lee Irwin III @ 2004-07-28 23:53 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andrew Morton, Hugh Dickins, linux-mm

On Wed, 28 Jul 2004, William Lee Irwin III wrote:
>> For the general case it may still make sense to do this. SGI will have
>> to comment here, as the workloads I'm involved with are kernel intensive
>> enough in other areas and generally run on small enough systems to have
>> no visible issues in or around the areas described.

On Wed, Jul 28, 2004 at 05:21:58PM -0500, Brent Casavant wrote:
> With Hugh's fix, the problem has now moved to other areas -- I consider
> the stat_lock issue solved.  Now I'm running up against the shmem_inode_info
> lock field.  A per-CPU structure isn't appropriate here because what it's
> mostly protecting is the inode swap entries, and that isn't at all amenable
> to a per-CPU breakdown (i.e. this is real data, not statistics).

This does look like it needs ad hoc methods for each of the various
fields.


On Wed, Jul 28, 2004 at 05:21:58PM -0500, Brent Casavant wrote:
> The "obvious" fix is to morph the code so that the swap entries can be
> updated in parallel to eachother and in parallel to the other miscellaneous
> fields in the shmem_inode_info structure.  But this would be one *nasty*
> piece of work to accomplish, much less accomplish cleanly and correctly.
> I'm pretty sure my Linux skillset isn't up to the task, though it hasn't
> kept me from trying.  On the upside I don't think it would significantly
> impact performance on low processor-count systems, if we can manage to
> do it at all.
> I'm kind of hoping for a fairy godmother to drop in, wave her magic wand,
> and say "Here's the quick and easy and obviously correct solution".  But
> what're the chances of that :).

This may actually have some positive impact on highly kernel-intensive
low processor count database workloads (where kernel intensiveness makes
up for the reduced processor count vs. the usual numerical applications
at high processor counts on SGI systems). At the moment a number of
stability issues have piled up that I need to take care of, but I would
be happy to work with you on devising methods of addressing this when
those clear up, which should be by the end of this week.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28 23:53               ` William Lee Irwin III
@ 2004-07-29 14:54                 ` Brent Casavant
  0 siblings, 0 replies; 24+ messages in thread
From: Brent Casavant @ 2004-07-29 14:54 UTC (permalink / raw)
  To: William Lee Irwin III; +Cc: Andrew Morton, Hugh Dickins, linux-mm

On Wed, 28 Jul 2004, William Lee Irwin III wrote:

> On Wed, Jul 28, 2004 at 05:21:58PM -0500, Brent Casavant wrote:
> > The "obvious" fix is to morph the code so that the swap entries can be
> > updated in parallel to eachother and in parallel to the other miscellaneous
> > fields in the shmem_inode_info structure.  But this would be one *nasty*
> > piece of work to accomplish, much less accomplish cleanly and correctly.
> > I'm pretty sure my Linux skillset isn't up to the task, though it hasn't
> > kept me from trying.  On the upside I don't think it would significantly
> > impact performance on low processor-count systems, if we can manage to
> > do it at all.
> > I'm kind of hoping for a fairy godmother to drop in, wave her magic wand,
> > and say "Here's the quick and easy and obviously correct solution".  But
> > what're the chances of that :).
>
> This may actually have some positive impact on highly kernel-intensive
> low processor count database workloads (where kernel intensiveness makes
> up for the reduced processor count vs. the usual numerical applications
> at high processor counts on SGI systems).

Good to know.  It always amazes me how close my knowledge horizon really is.

> At the moment a number of
> stability issues have piled up that I need to take care of, but I would
> be happy to work with you on devising methods of addressing this when
> those clear up, which should be by the end of this week.

Count me in.  I've been chewing on this one for a while now, and I'll
be more than happy to help.

Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-28 22:21             ` Brent Casavant
  2004-07-28 23:05               ` Andrew Morton
  2004-07-28 23:53               ` William Lee Irwin III
@ 2004-07-29 19:58               ` Hugh Dickins
  2004-07-29 21:21                 ` Brent Casavant
                                   ` (2 more replies)
  2 siblings, 3 replies; 24+ messages in thread
From: Hugh Dickins @ 2004-07-29 19:58 UTC (permalink / raw)
  To: Brent Casavant; +Cc: William Lee Irwin III, Andrew Morton, linux-mm

On Wed, 28 Jul 2004, Brent Casavant wrote:
> 
> With Hugh's fix, the problem has now moved to other areas -- I consider
> the stat_lock issue solved.

Me too, though I haven't passed those changes up the chain yet:
waiting to see what happens in this next round.

I didn't look into Andrew's percpu_counters in any depth:
once I'd come across PERCPU_ENOUGH_ROOM 32768 I concluded that
percpu space is a precious resource that we should resist depleting
per mountpoint; but if ext2/3 use it, I guess tmpfs could as well.
Revisit another time if NULL sbinfo found wanting.

> Now I'm running up against the shmem_inode_info
> lock field.  A per-CPU structure isn't appropriate here because what it's
> mostly protecting is the inode swap entries, and that isn't at all amenable
> to a per-CPU breakdown (i.e. this is real data, not statistics).

Jack Steiner's question was, why is this an issue on 2.6 when it
wasn't on 2.4?  Perhaps better parallelism elsewhere in 2.6 has
shifted contention to here?  Or was it an issue in 2.4 after all?

I keep wondering: why is contention on shmem_inode_info->lock a big
deal for you, but not contention on inode->i_mapping->tree_lock?

Once the shm segment or /dev/zero mapping pages are allocated, info->lock
shouldn't be used at all until you get to swapping - and I hope it's safe
to assume that someone with 512 cpus isn't optimizing for swapping.

It's true that when shmem_getpage is allocating index and data pages,
it dips into and out of info->lock several times: I expect that does
exacerbate the bouncing.  Earlier in the day I was trying to rewrite
it a little to avoid that, for you to investigate if it makes any
difference; but abandoned that once I realized it would mean
memclearing pages inside the lock, something I'd much rather avoid.

> The "obvious" fix is to morph the code so that the swap entries can be
> updated in parallel to eachother and in parallel to the other miscellaneous
> fields in the shmem_inode_info structure.

Why are all these threads allocating to the inode at the same time?

Are they all trying to lock down the same pages?  Or is each trying
to fault in a different page (as your "parallel" above suggests)?

Why doesn't the creator of the shm segment or /dev/zero mapping just
fault in all the pages before handing over to the other threads?

But I may well have entirely the wrong model of what's going on.
Could you provide a small .c testcase to show what it's actually
trying to do when the problem manifests?  I don't have many cpus
to reproduce it on, but it should help to provoke a solution.

And/or profiles.

(Once we've shifted the contention from info->lock to mapping->tree_lock,
it'll be interesting but not conclusive to hear how 2.6.8 compares with
2.6.8-mm: since mm is currently using read/write_lock_irq on tree_lock.)

Thanks,
Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-29 19:58               ` Hugh Dickins
@ 2004-07-29 21:21                 ` Brent Casavant
  2004-07-29 21:51                   ` Brent Casavant
  2004-07-30  1:00                 ` William Lee Irwin III
  2004-07-30 21:40                 ` Brent Casavant
  2 siblings, 1 reply; 24+ messages in thread
From: Brent Casavant @ 2004-07-29 21:21 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: William Lee Irwin III, Andrew Morton, linux-mm

On Thu, 29 Jul 2004, Hugh Dickins wrote:

> Jack Steiner's question was, why is this an issue on 2.6 when it
> wasn't on 2.4?  Perhaps better parallelism elsewhere in 2.6 has
> shifted contention to here?  Or was it an issue in 2.4 after all?

It was, but for some reason it didn't show up with this particular
test code.

> Why are all these threads allocating to the inode at the same time?
>
> Are they all trying to lock down the same pages?  Or is each trying
> to fault in a different page (as your "parallel" above suggests)?

They're all trying to fault in a different page.

> Why doesn't the creator of the shm segment or /dev/zero mapping just
> fault in all the pages before handing over to the other threads?

Performance.  The mapping could well range into the tens or hundreds
of gigabytes, and faulting these pages in parallel would certainly
be advantageous.

> But I may well have entirely the wrong model of what's going on.
> Could you provide a small .c testcase to show what it's actually
> trying to do when the problem manifests?  I don't have many cpus
> to reproduce it on, but it should help to provoke a solution.

Sure.  I'll forward it seperately.  I'll actually send you the
very program I've been using to test this work.  Jack Steiner
wrote it, so there shouldn't be any issue sharing it.

> (Once we've shifted the contention from info->lock to mapping->tree_lock,
> it'll be interesting but not conclusive to hear how 2.6.8 compares with
> 2.6.8-mm: since mm is currently using read/write_lock_irq on tree_lock.)

Therein lies the rub, right?  We solve one contention problem, only to
move it elsewhere. :)

Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-29 21:21                 ` Brent Casavant
@ 2004-07-29 21:51                   ` Brent Casavant
  0 siblings, 0 replies; 24+ messages in thread
From: Brent Casavant @ 2004-07-29 21:51 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: William Lee Irwin III, Andrew Morton, linux-mm

On Thu, 29 Jul 2004, Brent Casavant wrote:

> On Thu, 29 Jul 2004, Hugh Dickins wrote:

> > Why doesn't the creator of the shm segment or /dev/zero mapping just
> > fault in all the pages before handing over to the other threads?
>
> Performance.  The mapping could well range into the tens or hundreds
> of gigabytes, and faulting these pages in parallel would certainly
> be advantageous.

Oh, and let me clarify something.  I don't think anyone currently
performs mappings in the hundreds of gigabytes range.  But it probably
won't be too many years until that one happens.

But the basic point even at tens of gigabtyes is still valid.

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-29 19:58               ` Hugh Dickins
  2004-07-29 21:21                 ` Brent Casavant
@ 2004-07-30  1:00                 ` William Lee Irwin III
  2004-07-30 21:40                 ` Brent Casavant
  2 siblings, 0 replies; 24+ messages in thread
From: William Lee Irwin III @ 2004-07-30  1:00 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Brent Casavant, Andrew Morton, linux-mm

On Wed, 28 Jul 2004, Brent Casavant wrote:
>> With Hugh's fix, the problem has now moved to other areas -- I consider
>> the stat_lock issue solved.

On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote:
> Me too, though I haven't passed those changes up the chain yet:
> waiting to see what happens in this next round.
> I didn't look into Andrew's percpu_counters in any depth:
> once I'd come across PERCPU_ENOUGH_ROOM 32768 I concluded that
> percpu space is a precious resource that we should resist depleting
> per mountpoint; but if ext2/3 use it, I guess tmpfs could as well.
> Revisit another time if NULL sbinfo found wanting.

__alloc_percpu() doesn't seem to dip into this space; it rather seems
to use kmem_cache_alloc_node(), which shouldn't be subject to any
limitations beyond the nodes' memory capacities, and PERCPU_ENOUGH_ROOM
seems to be primarily for statically-allocated per_cpu data. This may
very well be the headroom reserved for modules; I've not tracked the
per_cpu internals for a very long time, as what little I had to
contribute there was dropped.


On Wed, 28 Jul 2004, Brent Casavant wrote:
>> Now I'm running up against the shmem_inode_info
>> lock field.  A per-CPU structure isn't appropriate here because what it's
>> mostly protecting is the inode swap entries, and that isn't at all amenable
>> to a per-CPU breakdown (i.e. this is real data, not statistics).

On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote:
> Jack Steiner's question was, why is this an issue on 2.6 when it
> wasn't on 2.4?  Perhaps better parallelism elsewhere in 2.6 has
> shifted contention to here?  Or was it an issue in 2.4 after all?
> I keep wondering: why is contention on shmem_inode_info->lock a big
> deal for you, but not contention on inode->i_mapping->tree_lock?

inode->i_mapping->tree_lock is where I've observed the majority of the
lock contention from operating on tmpfs files in parallel. I still need
to write up the benchmark results for the rwlock in a coherent fashion.
One thing I need to do there to support all this is to discover if the
kernel-intensive workloads on smaller machines actually do find
shmem_inode_info->lock to be an issue after mapping->tree_lock is made
an rwlock, as they appear to suffer from mapping->tree_lock first,
unlike the SGI workloads if these reports are accurate.


On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote:
> Once the shm segment or /dev/zero mapping pages are allocated, info->lock
> shouldn't be used at all until you get to swapping - and I hope it's safe
> to assume that someone with 512 cpus isn't optimizing for swapping.
> It's true that when shmem_getpage is allocating index and data pages,
> it dips into and out of info->lock several times: I expect that does
> exacerbate the bouncing.  Earlier in the day I was trying to rewrite
> it a little to avoid that, for you to investigate if it makes any
> difference; but abandoned that once I realized it would mean
> memclearing pages inside the lock, something I'd much rather avoid.

The workloads I'm running actually do encounter small amounts of
swap IO under higher loads. I'm not terribly concerned with this as
the "fix" that would be used in the field is adding more RAM, and
it's just generally not how those workloads are meant to be run, but
rather only a desperation measure.


On Wed, 28 Jul 2004, Brent Casavant wrote:
>> The "obvious" fix is to morph the code so that the swap entries can be
>> updated in parallel to eachother and in parallel to the other miscellaneous
>> fields in the shmem_inode_info structure.

On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote:
> Why are all these threads allocating to the inode at the same time?
> Are they all trying to lock down the same pages?  Or is each trying
> to fault in a different page (as your "parallel" above suggests)?
> Why doesn't the creator of the shm segment or /dev/zero mapping just
> fault in all the pages before handing over to the other threads?
> But I may well have entirely the wrong model of what's going on.
> Could you provide a small .c testcase to show what it's actually
> trying to do when the problem manifests?  I don't have many cpus
> to reproduce it on, but it should help to provoke a solution.
> And/or profiles.
> (Once we've shifted the contention from info->lock to mapping->tree_lock,
> it'll be interesting but not conclusive to hear how 2.6.8 compares with
> 2.6.8-mm: since mm is currently using read/write_lock_irq on tree_lock.)

If it's a particularly large area, this may be for incremental
initialization so there aren't very long delays during program startup.


-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-29 19:58               ` Hugh Dickins
  2004-07-29 21:21                 ` Brent Casavant
  2004-07-30  1:00                 ` William Lee Irwin III
@ 2004-07-30 21:40                 ` Brent Casavant
  2004-07-30 23:34                   ` Paul Jackson
  2 siblings, 1 reply; 24+ messages in thread
From: Brent Casavant @ 2004-07-30 21:40 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: William Lee Irwin III, Andrew Morton, linux-mm

On Thu, 29 Jul 2004, Hugh Dickins wrote:

> Why doesn't the creator of the shm segment or /dev/zero mapping just
> fault in all the pages before handing over to the other threads?

Dean Roe pointed out another answer to this.  For NUMA locality reasons
you want individual physical pages to be near the CPU which will use it
most heavily.  Having a single CPU fault in all the pages will generally
cause all pages to reside on a single NUMA node.

Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-30 21:40                 ` Brent Casavant
@ 2004-07-30 23:34                   ` Paul Jackson
  2004-07-31  3:37                     ` Ray Bryant
  0 siblings, 1 reply; 24+ messages in thread
From: Paul Jackson @ 2004-07-30 23:34 UTC (permalink / raw)
  To: Brent Casavant; +Cc: hugh, wli, akpm, linux-mm

Brent wrote:
> Having a single CPU fault in all the pages will generally
> cause all pages to reside on a single NUMA node.

Couldn't one use Andi Kleen's numa mbind() to layout the
memory across the desired nodes, before faulting it in?

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Scaling problem with shmem_sb_info->stat_lock
  2004-07-30 23:34                   ` Paul Jackson
@ 2004-07-31  3:37                     ` Ray Bryant
  0 siblings, 0 replies; 24+ messages in thread
From: Ray Bryant @ 2004-07-31  3:37 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Brent Casavant, hugh, wli, akpm, linux-mm

Perhaps, but then you still have one processor doing the zeroing and setup for 
all of those pages, and that can be a signficiant serial bottleneck.

Paul Jackson wrote:
> Brent wrote:
> 
>>Having a single CPU fault in all the pages will generally
>>cause all pages to reside on a single NUMA node.
> 
> 
> Couldn't one use Andi Kleen's numa mbind() to layout the
> memory across the desired nodes, before faulting it in?
> 

-- 
Best Regards,
Ray
-----------------------------------------------
                   Ray Bryant
512-453-9679 (work)         512-507-7807 (cell)
raybry@sgi.com             raybry@austin.rr.com
The box said: "Requires Windows 98 or better",
            so I installed Linux.
-----------------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2004-07-31  3:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-07-12 21:11 Scaling problem with shmem_sb_info->stat_lock Brent Casavant
2004-07-12 21:55 ` William Lee Irwin III
2004-07-12 22:42   ` Brent Casavant
2004-07-13 19:56     ` Brent Casavant
2004-07-13 20:41       ` Hugh Dickins
2004-07-13 21:35         ` Brent Casavant
2004-07-13 22:50           ` William Lee Irwin III
2004-07-13 22:22         ` William Lee Irwin III
2004-07-13 22:27         ` Brent Casavant
2004-07-28  9:26         ` Andrew Morton
2004-07-28  9:59           ` William Lee Irwin III
2004-07-28 22:21             ` Brent Casavant
2004-07-28 23:05               ` Andrew Morton
2004-07-28 23:40                 ` Brent Casavant
2004-07-28 23:53                   ` William Lee Irwin III
2004-07-28 23:53               ` William Lee Irwin III
2004-07-29 14:54                 ` Brent Casavant
2004-07-29 19:58               ` Hugh Dickins
2004-07-29 21:21                 ` Brent Casavant
2004-07-29 21:51                   ` Brent Casavant
2004-07-30  1:00                 ` William Lee Irwin III
2004-07-30 21:40                 ` Brent Casavant
2004-07-30 23:34                   ` Paul Jackson
2004-07-31  3:37                     ` Ray Bryant

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox