From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Sun, 10 Jun 2007 20:17:00 +0200 From: Andrea Arcangeli Subject: Re: [PATCH 01 of 16] remove nr_scan_inactive/active Message-ID: <20070610181700.GC7443@v2.random> References: <8e38f7656968417dfee0.1181332979@v2.random> <466C36AE.3000101@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <466C36AE.3000101@redhat.com> Sender: owner-linux-mm@kvack.org Return-Path: To: Rik van Riel Cc: linux-mm@kvack.org List-ID: On Sun, Jun 10, 2007 at 01:36:46PM -0400, Rik van Riel wrote: > Andrea Arcangeli wrote: > > >- else > >+ nr_inactive = zone_page_state(zone, NR_INACTIVE) >> priority; > >+ if (nr_inactive < sc->swap_cluster_max) > > nr_inactive = 0; > > This is a problem. > > On workloads with lots of anonymous memory, for example > running a very large JVM or simply stressing the system > with AIM7, the inactive list can be very small. > > If dozens (or even hundreds) of tasks get into the > pageout code simultaneously, they will all spend a lot > of time moving pages from the active to the inactive > list, but they will not even try to free any of the > (few) inactive pages the system has! > > We have observed systems in stress tests that spent > well over 10 minutes in shrink_active_list before > the first call to shrink_inactive_list was made. > > Your code looks like it could exacerbate that situation, > by not having zone->nr_scan_inactive increment between > calls. If all tasks spend 10 minutes in shrink_active_list before the first call to shrink_inactive_list that could mean you hit the race that I'm just trying to fix with this very patch. (i.e. nr_*active going totally huge because of the race triggering, and trashing over the few pages left in the *active_list until the artificially boosted nr_*active finally goes down to zero in all tasks that read it at the unlucky time when it got huge) So my patch may actually fix your situation completely if your trouble was nr_scan_active becoming huge for no good reason, just because many tasks entered the VM at the same time on big-SMP systems. Did you monitor the real sizes of the active lists during those 10 min and compared it to the nr_active stored in the stack? Normally if the highest priority passes only calls into shrink_active_list that's because the two lists needs rebalancing. But I fail to see how it could ever take 10min for the first shrink_inactive_list to trigger with my patch applied, while if it happens in current vanilla that could be the race triggering, or anyway something unrelated is going wrong in the VM. Overall this code seems quite flakey in its current "racy" form, so I doubt it can be allowed to live as-is. Infact even if we fix the race with a slow-shared-lock in a fast path or if we only make sure not to avoid exacerbate your situation with something a simple and lock-less as "nr_active = min(sizeof_active_list, nr_scan_active)", I think it would still wrong to do more work in the current tasks, if we've other tasks helping us at the same time. We should do nothing more, nothing less. So I think if we want those counters to avoid restarting from zero at each priority step (what I understand is your worry), those counters should be in the stack, task-local. That will still take into account the previously not scanned "nr_inactive" value. Not sure what's best. I've the feeling that introducing a task-local *nr_active *nr_inactive counter shared by all priority steps, won't move the VM needle much, but I sure wouldn't be against it. It will change the balancing to be more fair, but in practice I don't expect huge differences, there are only 12 steps anyway, very quickly the inactive list should be shrunk even if the active list is huge. I'm only generally against the current per-zone global and racy approach without limits, so potentially exacerbating your situation when nr_active becomes very huge despite the active list being very small. Thanks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org