* Scaling problem with shmem_sb_info->stat_lock @ 2004-07-12 21:11 Brent Casavant 2004-07-12 21:55 ` William Lee Irwin III 0 siblings, 1 reply; 24+ messages in thread From: Brent Casavant @ 2004-07-12 21:11 UTC (permalink / raw) To: hugh; +Cc: linux-mm Hugh, Christoph Hellwig recommended I email you about this issue. In the Linux kernel, in both 2.4 and 2.6, in the mm/shmem.c code, there is a stat_lock in the shmem_sb_info structure which protects (among other things) the free_blocks field and (under 2.6) the inode i_blocks field. At SGI we've found that on larger systems (>32P) undergoing parallel /dev/zero page faulting, as often happens during parallel application startup, this locking does not scale very well due to the lock cacheline bouncing between CPUs. Back in 2.4 Jack Steiner hacked on this code to avoid taking the lock when free_blocks was equal to ULONG_MAX, as it makes little sense to perform bookkeeping operations when there were no practical limits being requested. This (along with scaling fixes in other parts of the VM system) provided for very good scaling of /dev/zero page faulting. However, this could lead to problems in the shmem_set_size() function during a remount operation; but as that operation is apparently fairly rare on running systems, it solved the scaling problem in practice. I've hacked up the 2.6 shmem.c code to not require the stat_lock to be taken while accessing these two fields (free_blocks and i_blocks), but unfortunately this does nothing more than change which cacheline is bouncing around the system (the fields themselves, instead of the lock). This of course was not unexpected. Looking at this code, I don't see any straightforward way to alleviate this problem. So, I was wondering if you might have any ideas how one might approach this. I'm hoping for something that will give us good scaling all the way up to 512P. Thanks, Brent Casavant -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-12 21:11 Scaling problem with shmem_sb_info->stat_lock Brent Casavant @ 2004-07-12 21:55 ` William Lee Irwin III 2004-07-12 22:42 ` Brent Casavant 0 siblings, 1 reply; 24+ messages in thread From: William Lee Irwin III @ 2004-07-12 21:55 UTC (permalink / raw) To: Brent Casavant; +Cc: hugh, linux-mm On Mon, Jul 12, 2004 at 04:11:29PM -0500, Brent Casavant wrote: > Looking at this code, I don't see any straightforward way to alleviate > this problem. So, I was wondering if you might have any ideas how one > might approach this. I'm hoping for something that will give us good > scaling all the way up to 512P. Smells like per-cpu split counter material to me. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-12 21:55 ` William Lee Irwin III @ 2004-07-12 22:42 ` Brent Casavant 2004-07-13 19:56 ` Brent Casavant 0 siblings, 1 reply; 24+ messages in thread From: Brent Casavant @ 2004-07-12 22:42 UTC (permalink / raw) To: William Lee Irwin III; +Cc: hugh, linux-mm On Mon, 12 Jul 2004, William Lee Irwin III wrote: > On Mon, Jul 12, 2004 at 04:11:29PM -0500, Brent Casavant wrote: > > Looking at this code, I don't see any straightforward way to alleviate > > this problem. So, I was wondering if you might have any ideas how one > > might approach this. I'm hoping for something that will give us good > > scaling all the way up to 512P. > > Smells like per-cpu split counter material to me. It does to me too, particularly since the only time the i_blocks and free_blocks fields are used for something other than bookkeeping updates is the inode getattr operation, the superblock statfs operation, and the shmem_set_size() function which is called infrequently. The complication with this is that we'd either need to redefine i_blocks in the inode structure (somehow I don't see that happening), or move that field up into the shmem_inode_info structure and make the necessary code adjustments. I tried the latter approach (without splitting the counter) in one attempt, but never quite got the code right. I couldn't spot my bug, but it was sort a one-off hack-job anyway, and got me the information I needed at the time. Thanks, Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-12 22:42 ` Brent Casavant @ 2004-07-13 19:56 ` Brent Casavant 2004-07-13 20:41 ` Hugh Dickins 0 siblings, 1 reply; 24+ messages in thread From: Brent Casavant @ 2004-07-13 19:56 UTC (permalink / raw) To: William Lee Irwin III; +Cc: hugh, linux-mm On Mon, 12 Jul 2004, Brent Casavant wrote: > The complication with this is that we'd either need to redefine > i_blocks in the inode structure (somehow I don't see that happening), > or move that field up into the shmem_inode_info structure and make > the necessary code adjustments. Better idea, maybe. Jack Steiner suggested to me that we really don't care about accounting for i_blocks and free_blocks for /dev/zero mappings (question: Is he right?). If so, then it seems to me we could turn on a bit in the flags field of the shmem_inode_info structure that says "don't bother with bookkeeping for me". We can then test for that flag wherever i_blocks and free_blocks are updated, and omit the update if appropriate. This leaves tmpfs working appropriately for its "filesystem" role, and avoids the cacheline bouncing problem for its "shared /dev/zero mappings" role. Assuming this is correct, I imagine I should just snag the next bit in the flags field (bit 0 is SHMEM_PAGEIN (== VM_READ) and bit 1 is SHMEM_TRUNCATE (== VM_WRITE), I'd use bit 2 for SHMEM_NOACCT (== VM_EXEC)) and run with this idea, right? Thoughts? Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-13 19:56 ` Brent Casavant @ 2004-07-13 20:41 ` Hugh Dickins 2004-07-13 21:35 ` Brent Casavant ` (3 more replies) 0 siblings, 4 replies; 24+ messages in thread From: Hugh Dickins @ 2004-07-13 20:41 UTC (permalink / raw) To: Brent Casavant; +Cc: William Lee Irwin III, linux-mm On Tue, 13 Jul 2004, Brent Casavant wrote: > On Mon, 12 Jul 2004, Brent Casavant wrote: > > > The complication with this is that we'd either need to redefine > > i_blocks in the inode structure (somehow I don't see that happening), > > or move that field up into the shmem_inode_info structure and make > > the necessary code adjustments. > > Better idea, maybe. > > Jack Steiner suggested to me that we really don't care about accounting > for i_blocks and free_blocks for /dev/zero mappings (question: Is he > right?). I think Jack's right: there's no visible mount point for df or du, the files come ready-unlinked, nobody has an fd. Though wli's per-cpu idea was sensible enough, converting to that didn't appeal to me very much. We only have a limited amount of per-cpu space, I think, but an indefinite number of tmpfs mounts. Might be reasonable to allow per-cpu for 4 or them (the internal one which is troubling you, /dev/shm, /tmp and one other). Tiresome. Jack's perception appeals to me much more (but, like you, I do wonder if it'll really work out in practice). > If so, then it seems to me we could turn on a bit in the flags field > of the shmem_inode_info structure that says "don't bother with bookkeeping > for me". We can then test for that flag wherever i_blocks and free_blocks > are updated, and omit the update if appropriate. This leaves tmpfs > working appropriately for its "filesystem" role, and avoids the > cacheline bouncing problem for its "shared /dev/zero mappings" role. > > Assuming this is correct, I imagine I should just snag the next > bit in the flags field (bit 0 is SHMEM_PAGEIN (== VM_READ) and > bit 1 is SHMEM_TRUNCATE (== VM_WRITE), I'd use bit 2 for > SHMEM_NOACCT (== VM_EXEC)) and run with this idea, right? Yes, go ahead, though it's getting more and more embarrassing that I started out reusing VM_ACCOUNT within shmem.c, it should now have its own set of flags: let me tidy that up once you're done. (Something else I should do for your scalability is stop putting everything on on the shmem_inodes list: that's only needed when pages are on swap.) But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers to the security_vm_enough_memory/vm_unacct_memory stuff throughout, and _that_ accounting does still apply to these /dev/zero files. Hmm, I was about to suggest SHMEM_NOSBINFO, but how about really no sbinfo, just NULL sbinfo? Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-13 20:41 ` Hugh Dickins @ 2004-07-13 21:35 ` Brent Casavant 2004-07-13 22:50 ` William Lee Irwin III 2004-07-13 22:22 ` William Lee Irwin III ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Brent Casavant @ 2004-07-13 21:35 UTC (permalink / raw) To: Hugh Dickins; +Cc: William Lee Irwin III, linux-mm On Tue, 13 Jul 2004, Hugh Dickins wrote: > Though wli's per-cpu idea was sensible enough, converting to that > didn't appeal to me very much. We only have a limited amount of > per-cpu space, I think, but an indefinite number of tmpfs mounts. > Might be reasonable to allow per-cpu for 4 or them (the internal > one which is troubling you, /dev/shm, /tmp and one other). Tiresome. Per-CPU has the problem that the CPU on which you did a free_blocks++ might not be the same one where you do a free_blocks--. Bleh. Maybe using a hash indexed on some tid bits (pun unintended, but funny nevertheless) might work? But of course this suffers from the same class of problem as mentioned in the previous paragraph. > Yes, go ahead, though it's getting more and more embarrassing that I > started out reusing VM_ACCOUNT within shmem.c, it should now have its > own set of flags: let me tidy that up once you're done. Hmm. Guess that means I need to crack the whip on myself a bit... :) > But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers > to the security_vm_enough_memory/vm_unacct_memory stuff throughout, > and _that_ accounting does still apply to these /dev/zero files. > > Hmm, I was about to suggest SHMEM_NOSBINFO, > but how about really no sbinfo, just NULL sbinfo? If you'd like me to try that, I sure can. The only problem is that I'm having a devil of a time figuring out where the struct super_block comes from for /dev/null -- or heck, if it's even distinct from any others. And the relationship between /dev/null and /dev/shm is still quite fuzzy as well. Oh the joy of being new to a chunk of code... Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-13 21:35 ` Brent Casavant @ 2004-07-13 22:50 ` William Lee Irwin III 0 siblings, 0 replies; 24+ messages in thread From: William Lee Irwin III @ 2004-07-13 22:50 UTC (permalink / raw) To: Brent Casavant; +Cc: Hugh Dickins, linux-mm On Tue, 13 Jul 2004, Hugh Dickins wrote: >> Though wli's per-cpu idea was sensible enough, converting to that >> didn't appeal to me very much. We only have a limited amount of >> per-cpu space, I think, but an indefinite number of tmpfs mounts. >> Might be reasonable to allow per-cpu for 4 or them (the internal >> one which is troubling you, /dev/shm, /tmp and one other). Tiresome. On Tue, Jul 13, 2004 at 04:35:25PM -0500, Brent Casavant wrote: > Per-CPU has the problem that the CPU on which you did a free_blocks++ > might not be the same one where you do a free_blocks--. Bleh. > Maybe using a hash indexed on some tid bits (pun unintended, but funny > nevertheless) might work? But of course this suffers from the same > class of problem as mentioned in the previous paragraph. This is a non-issue. Full-fledged implementations of per-cpu counters must be either insensitive to or explicitly handle underflow. There are several different ways to do this; I think there's one in a kernel header that's an example of batched spills to and borrows from a global counter (note that the batches are O(NR_CPUS); important for reducing the arrival rate). Another way would be to steal from other cpus analogous to how the scheduler steals tasks. There's one in the scheduler I did, rq->nr_uninterruptible, that is insensitive to underflow; the values are only examined in summation, used for load average calculations. It makes some sense, too, as sleeping tasks aren't actually associated with runqueues, and so the per-runqueue values wouldn't be meaningful. I guess since that's not how it's being addressed anyway, it's academic. It may make some kind of theoretical sense for e.g. databases on similarly large cpu count systems, but in truth machines sensitive to this issue are just not used for such and would have far worse and more severe performance problems elsewhere, so again, why bother? On Tue, 13 Jul 2004, Hugh Dickins wrote: >> But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers >> to the security_vm_enough_memory/vm_unacct_memory stuff throughout, >> and _that_ accounting does still apply to these /dev/zero files. >> Hmm, I was about to suggest SHMEM_NOSBINFO, >> but how about really no sbinfo, just NULL sbinfo? On Tue, Jul 13, 2004 at 04:35:25PM -0500, Brent Casavant wrote: > If you'd like me to try that, I sure can. The only problem is that > I'm having a devil of a time figuring out where the struct super_block > comes from for /dev/null -- or heck, if it's even distinct from any > others. And the relationship between /dev/null and /dev/shm is still > quite fuzzy as well. Oh the joy of being new to a chunk of code... There is a global "anonymous mount" of tmpfs used to implement e.g. MAP_SHARED mappings of /dev/zero, SysV shm, etc. This mounted fs is not associated with any point in the fs namespace. So it's distinct from all other mounted instances that are e.g. associated with mountpoints in the fs namespace, and potentially even independent kern_mount()'d instances, though I know of no others apart from the one used in shmem.c, and they'd be awkward to arrange (static funcs & vars). This is just a convenience for setting up unlinked inodes etc. and can in principle be done without, which would remove even more forms of global state maintenance. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-13 20:41 ` Hugh Dickins 2004-07-13 21:35 ` Brent Casavant @ 2004-07-13 22:22 ` William Lee Irwin III 2004-07-13 22:27 ` Brent Casavant 2004-07-28 9:26 ` Andrew Morton 3 siblings, 0 replies; 24+ messages in thread From: William Lee Irwin III @ 2004-07-13 22:22 UTC (permalink / raw) To: Hugh Dickins; +Cc: Brent Casavant, linux-mm On Tue, Jul 13, 2004 at 09:41:34PM +0100, Hugh Dickins wrote: > I think Jack's right: there's no visible mount point for df or du, > the files come ready-unlinked, nobody has an fd. > Though wli's per-cpu idea was sensible enough, converting to that > didn't appeal to me very much. We only have a limited amount of > per-cpu space, I think, but an indefinite number of tmpfs mounts. > Might be reasonable to allow per-cpu for 4 or them (the internal > one which is troubling you, /dev/shm, /tmp and one other). Tiresome. > Jack's perception appeals to me much more > (but, like you, I do wonder if it'll really work out in practice). I ignored the specific usage case and looked only at the generic one. Though I actually had in mind just shoving an array of cachelines in the per-sb structure, it apparently is not even useful to maintain for the case in question, so why bother?. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-13 20:41 ` Hugh Dickins 2004-07-13 21:35 ` Brent Casavant 2004-07-13 22:22 ` William Lee Irwin III @ 2004-07-13 22:27 ` Brent Casavant 2004-07-28 9:26 ` Andrew Morton 3 siblings, 0 replies; 24+ messages in thread From: Brent Casavant @ 2004-07-13 22:27 UTC (permalink / raw) To: Hugh Dickins; +Cc: William Lee Irwin III, linux-mm On Tue, 13 Jul 2004, Hugh Dickins wrote: > On Tue, 13 Jul 2004, Brent Casavant wrote: > > Assuming this is correct, I imagine I should just snag the next > > bit in the flags field (bit 0 is SHMEM_PAGEIN (== VM_READ) and > > bit 1 is SHMEM_TRUNCATE (== VM_WRITE), I'd use bit 2 for > > SHMEM_NOACCT (== VM_EXEC)) and run with this idea, right? > > Yes, go ahead, though it's getting more and more embarrassing that I > started out reusing VM_ACCOUNT within shmem.c, it should now have its > own set of flags: let me tidy that up once you're done. (Something > else I should do for your scalability is stop putting everything on > on the shmem_inodes list: that's only needed when pages are on swap.) > > But please don't call the new one SHMEM_NOACCT: ACCT or ACCOUNT refers > to the security_vm_enough_memory/vm_unacct_memory stuff throughout, > and _that_ accounting does still apply to these /dev/zero files. > > Hmm, I was about to suggest SHMEM_NOSBINFO, > but how about really no sbinfo, just NULL sbinfo? OK, I gave this a try (calling it SHMEM_NOSBINFO). It seems to work functionally. I can't get time on our 512P until tomorrow morning (CDT), so I'll hold off on the patch until I've seen that it really fixes the problem. I'd really like to volunteer to do the work to have a NULL sbinfo entirely. But that might take a lot more time to accomplish as I'm still puzzled by how all these pieces interact. Thanks, Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-13 20:41 ` Hugh Dickins ` (2 preceding siblings ...) 2004-07-13 22:27 ` Brent Casavant @ 2004-07-28 9:26 ` Andrew Morton 2004-07-28 9:59 ` William Lee Irwin III 3 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2004-07-28 9:26 UTC (permalink / raw) To: Hugh Dickins; +Cc: bcasavan, wli, linux-mm Hugh Dickins <hugh@veritas.com> wrote: > > Though wli's per-cpu idea was sensible enough, converting to that > didn't appeal to me very much. We only have a limited amount of > per-cpu space, I think, but an indefinite number of tmpfs mounts. What's wrong with <linux/percpu_counter.h>? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 9:26 ` Andrew Morton @ 2004-07-28 9:59 ` William Lee Irwin III 2004-07-28 22:21 ` Brent Casavant 0 siblings, 1 reply; 24+ messages in thread From: William Lee Irwin III @ 2004-07-28 9:59 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, bcasavan, linux-mm Hugh Dickins <hugh@veritas.com> wrote: >> Though wli's per-cpu idea was sensible enough, converting to that >> didn't appeal to me very much. We only have a limited amount of >> per-cpu space, I think, but an indefinite number of tmpfs mounts. On Wed, Jul 28, 2004 at 02:26:25AM -0700, Andrew Morton wrote: > What's wrong with <linux/percpu_counter.h>? One issue with using it for the specific cases in question is that the maintenance of the statistics is entirely unnecessary for them. For the general case it may still make sense to do this. SGI will have to comment here, as the workloads I'm involved with are kernel intensive enough in other areas and generally run on small enough systems to have no visible issues in or around the areas described. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 9:59 ` William Lee Irwin III @ 2004-07-28 22:21 ` Brent Casavant 2004-07-28 23:05 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Brent Casavant @ 2004-07-28 22:21 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrew Morton, Hugh Dickins, linux-mm On Wed, 28 Jul 2004, William Lee Irwin III wrote: > Hugh Dickins <hugh@veritas.com> wrote: > >> Though wli's per-cpu idea was sensible enough, converting to that > >> didn't appeal to me very much. We only have a limited amount of > >> per-cpu space, I think, but an indefinite number of tmpfs mounts. > > On Wed, Jul 28, 2004 at 02:26:25AM -0700, Andrew Morton wrote: > > What's wrong with <linux/percpu_counter.h>? > > One issue with using it for the specific cases in question is that the > maintenance of the statistics is entirely unnecessary for them. Yeah. Hugh solved the stat_lock issue by getting rid of the superblock info for the internal superblock(s?) corresponding to /dev/zero and System V shared memory. There was no way to get at that information anyway, so it wasn't useful to pay to keep it around. > For the general case it may still make sense to do this. SGI will have > to comment here, as the workloads I'm involved with are kernel intensive > enough in other areas and generally run on small enough systems to have > no visible issues in or around the areas described. With Hugh's fix, the problem has now moved to other areas -- I consider the stat_lock issue solved. Now I'm running up against the shmem_inode_info lock field. A per-CPU structure isn't appropriate here because what it's mostly protecting is the inode swap entries, and that isn't at all amenable to a per-CPU breakdown (i.e. this is real data, not statistics). The "obvious" fix is to morph the code so that the swap entries can be updated in parallel to eachother and in parallel to the other miscellaneous fields in the shmem_inode_info structure. But this would be one *nasty* piece of work to accomplish, much less accomplish cleanly and correctly. I'm pretty sure my Linux skillset isn't up to the task, though it hasn't kept me from trying. On the upside I don't think it would significantly impact performance on low processor-count systems, if we can manage to do it at all. I'm kind of hoping for a fairy godmother to drop in, wave her magic wand, and say "Here's the quick and easy and obviously correct solution". But what're the chances of that :). Thanks, Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 22:21 ` Brent Casavant @ 2004-07-28 23:05 ` Andrew Morton 2004-07-28 23:40 ` Brent Casavant 2004-07-28 23:53 ` William Lee Irwin III 2004-07-29 19:58 ` Hugh Dickins 2 siblings, 1 reply; 24+ messages in thread From: Andrew Morton @ 2004-07-28 23:05 UTC (permalink / raw) To: Brent Casavant; +Cc: wli, hugh, linux-mm Brent Casavant <bcasavan@sgi.com> wrote: > > Now I'm running up against the shmem_inode_info > lock field. Normally a per-inode lock doesn't hurt too much because it's rare for lots of tasks to whack on the same inode at the same time. I guess with tmpfs-backed-shm, we have a rare workload. How unpleasant. > I'm kind of hoping for a fairy godmother to drop in, wave her magic wand, > and say "Here's the quick and easy and obviously correct solution". But > what're the chances of that :). Oh, sending email to Hugh is one of my favourite problem-solving techniques. Grab a beer and sit back. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 23:05 ` Andrew Morton @ 2004-07-28 23:40 ` Brent Casavant 2004-07-28 23:53 ` William Lee Irwin III 0 siblings, 1 reply; 24+ messages in thread From: Brent Casavant @ 2004-07-28 23:40 UTC (permalink / raw) To: Andrew Morton; +Cc: linux-mm On Wed, 28 Jul 2004, Andrew Morton wrote: > Brent Casavant <bcasavan@sgi.com> wrote: > Normally a per-inode lock doesn't hurt too much because it's rare > for lots of tasks to whack on the same inode at the same time. > > I guess with tmpfs-backed-shm, we have a rare workload. How > unpleasant. Well, it's really not even a common workload for tmpfs-backed-shm, where common means "non-HPC". Where SGI ran into this problem is with MPI startup. Our workaround at this time is to replace one large /dev/zero mapping shared amongst many forked processes (e.g. one process per CPU) with a bunch of single-page mappings of the same total size. This apparently has the effect of breaking the mapping up into multiple inodes, and reduces contention for any particular inode lock. But that's an ugly hack, and we really want to get rid of it. I may be talking out my rear, but I suspect that this will cause issues elsewhere (e.g. lots of tiny VM regions to track, which can be painful at fork/exec/exit time [if my IRIX experience serves me well]). I can look into the specifics of the workaround and probably provide numbers if anyone is really interested in such things at this point. Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 23:40 ` Brent Casavant @ 2004-07-28 23:53 ` William Lee Irwin III 0 siblings, 0 replies; 24+ messages in thread From: William Lee Irwin III @ 2004-07-28 23:53 UTC (permalink / raw) To: Brent Casavant; +Cc: Andrew Morton, linux-mm On Wed, Jul 28, 2004 at 06:40:40PM -0500, Brent Casavant wrote: > Well, it's really not even a common workload for tmpfs-backed-shm, where > common means "non-HPC". Where SGI ran into this problem is with MPI > startup. Our workaround at this time is to replace one large /dev/zero > mapping shared amongst many forked processes (e.g. one process per CPU) > with a bunch of single-page mappings of the same total size. This > apparently has the effect of breaking the mapping up into multiple inodes, > and reduces contention for any particular inode lock. > But that's an ugly hack, and we really want to get rid of it. I may be > talking out my rear, but I suspect that this will cause issues elsewhere > (e.g. lots of tiny VM regions to track, which can be painful at > fork/exec/exit time [if my IRIX experience serves me well]). I can look > into the specifics of the workaround and probably provide numbers if > anyone is really interested in such things at this point. I'm very interested. I have similar issues with per-inode locks in other contexts. This one is bound to factor in as well. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 22:21 ` Brent Casavant 2004-07-28 23:05 ` Andrew Morton @ 2004-07-28 23:53 ` William Lee Irwin III 2004-07-29 14:54 ` Brent Casavant 2004-07-29 19:58 ` Hugh Dickins 2 siblings, 1 reply; 24+ messages in thread From: William Lee Irwin III @ 2004-07-28 23:53 UTC (permalink / raw) To: Brent Casavant; +Cc: Andrew Morton, Hugh Dickins, linux-mm On Wed, 28 Jul 2004, William Lee Irwin III wrote: >> For the general case it may still make sense to do this. SGI will have >> to comment here, as the workloads I'm involved with are kernel intensive >> enough in other areas and generally run on small enough systems to have >> no visible issues in or around the areas described. On Wed, Jul 28, 2004 at 05:21:58PM -0500, Brent Casavant wrote: > With Hugh's fix, the problem has now moved to other areas -- I consider > the stat_lock issue solved. Now I'm running up against the shmem_inode_info > lock field. A per-CPU structure isn't appropriate here because what it's > mostly protecting is the inode swap entries, and that isn't at all amenable > to a per-CPU breakdown (i.e. this is real data, not statistics). This does look like it needs ad hoc methods for each of the various fields. On Wed, Jul 28, 2004 at 05:21:58PM -0500, Brent Casavant wrote: > The "obvious" fix is to morph the code so that the swap entries can be > updated in parallel to eachother and in parallel to the other miscellaneous > fields in the shmem_inode_info structure. But this would be one *nasty* > piece of work to accomplish, much less accomplish cleanly and correctly. > I'm pretty sure my Linux skillset isn't up to the task, though it hasn't > kept me from trying. On the upside I don't think it would significantly > impact performance on low processor-count systems, if we can manage to > do it at all. > I'm kind of hoping for a fairy godmother to drop in, wave her magic wand, > and say "Here's the quick and easy and obviously correct solution". But > what're the chances of that :). This may actually have some positive impact on highly kernel-intensive low processor count database workloads (where kernel intensiveness makes up for the reduced processor count vs. the usual numerical applications at high processor counts on SGI systems). At the moment a number of stability issues have piled up that I need to take care of, but I would be happy to work with you on devising methods of addressing this when those clear up, which should be by the end of this week. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 23:53 ` William Lee Irwin III @ 2004-07-29 14:54 ` Brent Casavant 0 siblings, 0 replies; 24+ messages in thread From: Brent Casavant @ 2004-07-29 14:54 UTC (permalink / raw) To: William Lee Irwin III; +Cc: Andrew Morton, Hugh Dickins, linux-mm On Wed, 28 Jul 2004, William Lee Irwin III wrote: > On Wed, Jul 28, 2004 at 05:21:58PM -0500, Brent Casavant wrote: > > The "obvious" fix is to morph the code so that the swap entries can be > > updated in parallel to eachother and in parallel to the other miscellaneous > > fields in the shmem_inode_info structure. But this would be one *nasty* > > piece of work to accomplish, much less accomplish cleanly and correctly. > > I'm pretty sure my Linux skillset isn't up to the task, though it hasn't > > kept me from trying. On the upside I don't think it would significantly > > impact performance on low processor-count systems, if we can manage to > > do it at all. > > I'm kind of hoping for a fairy godmother to drop in, wave her magic wand, > > and say "Here's the quick and easy and obviously correct solution". But > > what're the chances of that :). > > This may actually have some positive impact on highly kernel-intensive > low processor count database workloads (where kernel intensiveness makes > up for the reduced processor count vs. the usual numerical applications > at high processor counts on SGI systems). Good to know. It always amazes me how close my knowledge horizon really is. > At the moment a number of > stability issues have piled up that I need to take care of, but I would > be happy to work with you on devising methods of addressing this when > those clear up, which should be by the end of this week. Count me in. I've been chewing on this one for a while now, and I'll be more than happy to help. Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-28 22:21 ` Brent Casavant 2004-07-28 23:05 ` Andrew Morton 2004-07-28 23:53 ` William Lee Irwin III @ 2004-07-29 19:58 ` Hugh Dickins 2004-07-29 21:21 ` Brent Casavant ` (2 more replies) 2 siblings, 3 replies; 24+ messages in thread From: Hugh Dickins @ 2004-07-29 19:58 UTC (permalink / raw) To: Brent Casavant; +Cc: William Lee Irwin III, Andrew Morton, linux-mm On Wed, 28 Jul 2004, Brent Casavant wrote: > > With Hugh's fix, the problem has now moved to other areas -- I consider > the stat_lock issue solved. Me too, though I haven't passed those changes up the chain yet: waiting to see what happens in this next round. I didn't look into Andrew's percpu_counters in any depth: once I'd come across PERCPU_ENOUGH_ROOM 32768 I concluded that percpu space is a precious resource that we should resist depleting per mountpoint; but if ext2/3 use it, I guess tmpfs could as well. Revisit another time if NULL sbinfo found wanting. > Now I'm running up against the shmem_inode_info > lock field. A per-CPU structure isn't appropriate here because what it's > mostly protecting is the inode swap entries, and that isn't at all amenable > to a per-CPU breakdown (i.e. this is real data, not statistics). Jack Steiner's question was, why is this an issue on 2.6 when it wasn't on 2.4? Perhaps better parallelism elsewhere in 2.6 has shifted contention to here? Or was it an issue in 2.4 after all? I keep wondering: why is contention on shmem_inode_info->lock a big deal for you, but not contention on inode->i_mapping->tree_lock? Once the shm segment or /dev/zero mapping pages are allocated, info->lock shouldn't be used at all until you get to swapping - and I hope it's safe to assume that someone with 512 cpus isn't optimizing for swapping. It's true that when shmem_getpage is allocating index and data pages, it dips into and out of info->lock several times: I expect that does exacerbate the bouncing. Earlier in the day I was trying to rewrite it a little to avoid that, for you to investigate if it makes any difference; but abandoned that once I realized it would mean memclearing pages inside the lock, something I'd much rather avoid. > The "obvious" fix is to morph the code so that the swap entries can be > updated in parallel to eachother and in parallel to the other miscellaneous > fields in the shmem_inode_info structure. Why are all these threads allocating to the inode at the same time? Are they all trying to lock down the same pages? Or is each trying to fault in a different page (as your "parallel" above suggests)? Why doesn't the creator of the shm segment or /dev/zero mapping just fault in all the pages before handing over to the other threads? But I may well have entirely the wrong model of what's going on. Could you provide a small .c testcase to show what it's actually trying to do when the problem manifests? I don't have many cpus to reproduce it on, but it should help to provoke a solution. And/or profiles. (Once we've shifted the contention from info->lock to mapping->tree_lock, it'll be interesting but not conclusive to hear how 2.6.8 compares with 2.6.8-mm: since mm is currently using read/write_lock_irq on tree_lock.) Thanks, Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-29 19:58 ` Hugh Dickins @ 2004-07-29 21:21 ` Brent Casavant 2004-07-29 21:51 ` Brent Casavant 2004-07-30 1:00 ` William Lee Irwin III 2004-07-30 21:40 ` Brent Casavant 2 siblings, 1 reply; 24+ messages in thread From: Brent Casavant @ 2004-07-29 21:21 UTC (permalink / raw) To: Hugh Dickins; +Cc: William Lee Irwin III, Andrew Morton, linux-mm On Thu, 29 Jul 2004, Hugh Dickins wrote: > Jack Steiner's question was, why is this an issue on 2.6 when it > wasn't on 2.4? Perhaps better parallelism elsewhere in 2.6 has > shifted contention to here? Or was it an issue in 2.4 after all? It was, but for some reason it didn't show up with this particular test code. > Why are all these threads allocating to the inode at the same time? > > Are they all trying to lock down the same pages? Or is each trying > to fault in a different page (as your "parallel" above suggests)? They're all trying to fault in a different page. > Why doesn't the creator of the shm segment or /dev/zero mapping just > fault in all the pages before handing over to the other threads? Performance. The mapping could well range into the tens or hundreds of gigabytes, and faulting these pages in parallel would certainly be advantageous. > But I may well have entirely the wrong model of what's going on. > Could you provide a small .c testcase to show what it's actually > trying to do when the problem manifests? I don't have many cpus > to reproduce it on, but it should help to provoke a solution. Sure. I'll forward it seperately. I'll actually send you the very program I've been using to test this work. Jack Steiner wrote it, so there shouldn't be any issue sharing it. > (Once we've shifted the contention from info->lock to mapping->tree_lock, > it'll be interesting but not conclusive to hear how 2.6.8 compares with > 2.6.8-mm: since mm is currently using read/write_lock_irq on tree_lock.) Therein lies the rub, right? We solve one contention problem, only to move it elsewhere. :) Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-29 21:21 ` Brent Casavant @ 2004-07-29 21:51 ` Brent Casavant 0 siblings, 0 replies; 24+ messages in thread From: Brent Casavant @ 2004-07-29 21:51 UTC (permalink / raw) To: Hugh Dickins; +Cc: William Lee Irwin III, Andrew Morton, linux-mm On Thu, 29 Jul 2004, Brent Casavant wrote: > On Thu, 29 Jul 2004, Hugh Dickins wrote: > > Why doesn't the creator of the shm segment or /dev/zero mapping just > > fault in all the pages before handing over to the other threads? > > Performance. The mapping could well range into the tens or hundreds > of gigabytes, and faulting these pages in parallel would certainly > be advantageous. Oh, and let me clarify something. I don't think anyone currently performs mappings in the hundreds of gigabytes range. But it probably won't be too many years until that one happens. But the basic point even at tens of gigabtyes is still valid. -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-29 19:58 ` Hugh Dickins 2004-07-29 21:21 ` Brent Casavant @ 2004-07-30 1:00 ` William Lee Irwin III 2004-07-30 21:40 ` Brent Casavant 2 siblings, 0 replies; 24+ messages in thread From: William Lee Irwin III @ 2004-07-30 1:00 UTC (permalink / raw) To: Hugh Dickins; +Cc: Brent Casavant, Andrew Morton, linux-mm On Wed, 28 Jul 2004, Brent Casavant wrote: >> With Hugh's fix, the problem has now moved to other areas -- I consider >> the stat_lock issue solved. On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote: > Me too, though I haven't passed those changes up the chain yet: > waiting to see what happens in this next round. > I didn't look into Andrew's percpu_counters in any depth: > once I'd come across PERCPU_ENOUGH_ROOM 32768 I concluded that > percpu space is a precious resource that we should resist depleting > per mountpoint; but if ext2/3 use it, I guess tmpfs could as well. > Revisit another time if NULL sbinfo found wanting. __alloc_percpu() doesn't seem to dip into this space; it rather seems to use kmem_cache_alloc_node(), which shouldn't be subject to any limitations beyond the nodes' memory capacities, and PERCPU_ENOUGH_ROOM seems to be primarily for statically-allocated per_cpu data. This may very well be the headroom reserved for modules; I've not tracked the per_cpu internals for a very long time, as what little I had to contribute there was dropped. On Wed, 28 Jul 2004, Brent Casavant wrote: >> Now I'm running up against the shmem_inode_info >> lock field. A per-CPU structure isn't appropriate here because what it's >> mostly protecting is the inode swap entries, and that isn't at all amenable >> to a per-CPU breakdown (i.e. this is real data, not statistics). On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote: > Jack Steiner's question was, why is this an issue on 2.6 when it > wasn't on 2.4? Perhaps better parallelism elsewhere in 2.6 has > shifted contention to here? Or was it an issue in 2.4 after all? > I keep wondering: why is contention on shmem_inode_info->lock a big > deal for you, but not contention on inode->i_mapping->tree_lock? inode->i_mapping->tree_lock is where I've observed the majority of the lock contention from operating on tmpfs files in parallel. I still need to write up the benchmark results for the rwlock in a coherent fashion. One thing I need to do there to support all this is to discover if the kernel-intensive workloads on smaller machines actually do find shmem_inode_info->lock to be an issue after mapping->tree_lock is made an rwlock, as they appear to suffer from mapping->tree_lock first, unlike the SGI workloads if these reports are accurate. On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote: > Once the shm segment or /dev/zero mapping pages are allocated, info->lock > shouldn't be used at all until you get to swapping - and I hope it's safe > to assume that someone with 512 cpus isn't optimizing for swapping. > It's true that when shmem_getpage is allocating index and data pages, > it dips into and out of info->lock several times: I expect that does > exacerbate the bouncing. Earlier in the day I was trying to rewrite > it a little to avoid that, for you to investigate if it makes any > difference; but abandoned that once I realized it would mean > memclearing pages inside the lock, something I'd much rather avoid. The workloads I'm running actually do encounter small amounts of swap IO under higher loads. I'm not terribly concerned with this as the "fix" that would be used in the field is adding more RAM, and it's just generally not how those workloads are meant to be run, but rather only a desperation measure. On Wed, 28 Jul 2004, Brent Casavant wrote: >> The "obvious" fix is to morph the code so that the swap entries can be >> updated in parallel to eachother and in parallel to the other miscellaneous >> fields in the shmem_inode_info structure. On Thu, Jul 29, 2004 at 08:58:54PM +0100, Hugh Dickins wrote: > Why are all these threads allocating to the inode at the same time? > Are they all trying to lock down the same pages? Or is each trying > to fault in a different page (as your "parallel" above suggests)? > Why doesn't the creator of the shm segment or /dev/zero mapping just > fault in all the pages before handing over to the other threads? > But I may well have entirely the wrong model of what's going on. > Could you provide a small .c testcase to show what it's actually > trying to do when the problem manifests? I don't have many cpus > to reproduce it on, but it should help to provoke a solution. > And/or profiles. > (Once we've shifted the contention from info->lock to mapping->tree_lock, > it'll be interesting but not conclusive to hear how 2.6.8 compares with > 2.6.8-mm: since mm is currently using read/write_lock_irq on tree_lock.) If it's a particularly large area, this may be for incremental initialization so there aren't very long delays during program startup. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-29 19:58 ` Hugh Dickins 2004-07-29 21:21 ` Brent Casavant 2004-07-30 1:00 ` William Lee Irwin III @ 2004-07-30 21:40 ` Brent Casavant 2004-07-30 23:34 ` Paul Jackson 2 siblings, 1 reply; 24+ messages in thread From: Brent Casavant @ 2004-07-30 21:40 UTC (permalink / raw) To: Hugh Dickins; +Cc: William Lee Irwin III, Andrew Morton, linux-mm On Thu, 29 Jul 2004, Hugh Dickins wrote: > Why doesn't the creator of the shm segment or /dev/zero mapping just > fault in all the pages before handing over to the other threads? Dean Roe pointed out another answer to this. For NUMA locality reasons you want individual physical pages to be near the CPU which will use it most heavily. Having a single CPU fault in all the pages will generally cause all pages to reside on a single NUMA node. Brent -- Brent Casavant bcasavan@sgi.com Forget bright-eyed and Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red- Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-30 21:40 ` Brent Casavant @ 2004-07-30 23:34 ` Paul Jackson 2004-07-31 3:37 ` Ray Bryant 0 siblings, 1 reply; 24+ messages in thread From: Paul Jackson @ 2004-07-30 23:34 UTC (permalink / raw) To: Brent Casavant; +Cc: hugh, wli, akpm, linux-mm Brent wrote: > Having a single CPU fault in all the pages will generally > cause all pages to reside on a single NUMA node. Couldn't one use Andi Kleen's numa mbind() to layout the memory across the desired nodes, before faulting it in? -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.650.933.1373 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: Scaling problem with shmem_sb_info->stat_lock 2004-07-30 23:34 ` Paul Jackson @ 2004-07-31 3:37 ` Ray Bryant 0 siblings, 0 replies; 24+ messages in thread From: Ray Bryant @ 2004-07-31 3:37 UTC (permalink / raw) To: Paul Jackson; +Cc: Brent Casavant, hugh, wli, akpm, linux-mm Perhaps, but then you still have one processor doing the zeroing and setup for all of those pages, and that can be a signficiant serial bottleneck. Paul Jackson wrote: > Brent wrote: > >>Having a single CPU fault in all the pages will generally >>cause all pages to reside on a single NUMA node. > > > Couldn't one use Andi Kleen's numa mbind() to layout the > memory across the desired nodes, before faulting it in? > -- Best Regards, Ray ----------------------------------------------- Ray Bryant 512-453-9679 (work) 512-507-7807 (cell) raybry@sgi.com raybry@austin.rr.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a> ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2004-07-31 3:37 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-07-12 21:11 Scaling problem with shmem_sb_info->stat_lock Brent Casavant 2004-07-12 21:55 ` William Lee Irwin III 2004-07-12 22:42 ` Brent Casavant 2004-07-13 19:56 ` Brent Casavant 2004-07-13 20:41 ` Hugh Dickins 2004-07-13 21:35 ` Brent Casavant 2004-07-13 22:50 ` William Lee Irwin III 2004-07-13 22:22 ` William Lee Irwin III 2004-07-13 22:27 ` Brent Casavant 2004-07-28 9:26 ` Andrew Morton 2004-07-28 9:59 ` William Lee Irwin III 2004-07-28 22:21 ` Brent Casavant 2004-07-28 23:05 ` Andrew Morton 2004-07-28 23:40 ` Brent Casavant 2004-07-28 23:53 ` William Lee Irwin III 2004-07-28 23:53 ` William Lee Irwin III 2004-07-29 14:54 ` Brent Casavant 2004-07-29 19:58 ` Hugh Dickins 2004-07-29 21:21 ` Brent Casavant 2004-07-29 21:51 ` Brent Casavant 2004-07-30 1:00 ` William Lee Irwin III 2004-07-30 21:40 ` Brent Casavant 2004-07-30 23:34 ` Paul Jackson 2004-07-31 3:37 ` Ray Bryant
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox