From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <3D2540CE.89A1688E@zip.com.au> Date: Thu, 04 Jul 2002 23:46:38 -0700 From: Andrew Morton MIME-Version: 1.0 Subject: Re: vm lock contention reduction References: <3D2530B9.8BC0C0AE@zip.com.au> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Return-Path: To: Linus Torvalds Cc: Rik van Riel , Andrea Arcangeli , "linux-mm@kvack.org" List-ID: Linus Torvalds wrote: > > On Thu, 4 Jul 2002, Andrew Morton wrote: > > > > > > Get away from this "minimum wait" thing, because it is WRONG. > > > > Well yes, we do want to batch work up. And a crude way of doing that > > is "each time 64 pages have come clean, wake up one waiter". > > That would probably work, but we also need to be careful that we don't get > into some strange situations (ie we have 50 waiters that all needed memory > at the same time, and less than 50*64 pages that caused us to be in > trouble, so we only wake up the 46 first waiters and the last 4 waiters > get stuck until the next batch even though we now have _lots_ of pages > free). > > Dont't laugh - things like this has actually happened at times with some > of our balancing work with HIGHMEM/NORMAL. Basically, the logic would go: > "everybody who ends up having to wait for an allocation should free at > least N pages", but then you would end up with 50*N pages total that the > system thought it "needed" to free up, and that could be a big number that > would cause the VM to want to free up stuff long after it was really done. Well we'd certainly need to make direct caller-reclaims the normal mode of operation. Avoid context switches in the page allocator. However it occurs to me that we could easily get in the situation where a page allocator find a PageDirty or PageWriteback page on the tail of the LRU and waits on it, but there are plenty of reclaimable pages further along in the LRU. In this situation it would better to just tag the page as "wait on it next time around" and then just skip it. This is basically what the PageLaunder/BH_Launder logic was doing. I think. For a long time it wasn't very effective because it wasn't able to wait on page/buffers which were written by someone else. Andrea finally sorted that by setting BH_Launder in submit_bh. All those deadlock problems have gone away now and we can (re)implement this much more simply. But what I don't like about it is that it's dumb. The kernel ends up doing these enormous list scans and not achieving very much, whereas we could achieve the same effect by doing a bit of page motion at interrupt time. It's a polled-versus-interrupt thing. And right now, that dumbness is multiplied by the CPU count because it happens under pagemap_lru_lock. But with the bustup of that, at least we can be scalably dumb ;) > > Or > > "as soon as the number of reclaimable pages exceeds zone->pages_min". > > Some logic would also be needed to prevent new page allocators from > > jumping the queue, of course. > > Yeah, the unfairness is the thing that really can be nasty. > > On the other hand, some unfairness is ok too - and can improve throughput. > So jumping the queue is fine, you just must not be able to _consistently_ > jump the queue. > > (In fact, jumping the queue is inevitable to some degree - not allowing > any queue jumping at all would imply that any time _anybody_ starts > waiting, every single allocation afterwards will have to wait until the > waiter got woken up. Which we have actually tried before at times, but > which causes really really bad behaviour and horribly bad "pausing") > > You probably want the occasional allocator able to jump the queue, but the > "big spenders" to be caught eventually. "Fairness" really doesn't mean > that "everybody should wait equally much", it really means "people should > wait roughly relative to how much as they 'spend' memory". Right. And that implies heuristics to divine which tasks are heavy page allocators. uh-oh. But as a first-order approximation: if a task is currently allocating pages from within generic_file_write(), then whack it hard. Here's PageLaunder-for-2.5. (Not tested enough - don't apply ;)) It seems to help vmstat somewhat, but it still gets stuck in shrink_cache->get_request_wait() a lot. include/linux/page-flags.h | 7 +++++++ mm/filemap.c | 1 + mm/vmscan.c | 2 +- 3 files changed, 9 insertions(+), 1 deletion(-) --- 2.5.24/mm/vmscan.c~second-chance-throttle Thu Jul 4 23:17:14 2002 +++ 2.5.24-akpm/mm/vmscan.c Thu Jul 4 23:18:40 2002 @@ -443,7 +443,7 @@ shrink_cache(int nr_pages, zone_t *class * IO in progress? Leave it at the back of the list. */ if (unlikely(PageWriteback(page))) { - if (may_enter_fs) { + if (may_enter_fs && TestSetPageThrottle(page)) { page_cache_get(page); spin_unlock(&pagemap_lru_lock); wait_on_page_writeback(page); --- 2.5.24/include/linux/page-flags.h~second-chance-throttle Thu Jul 4 23:18:35 2002 +++ 2.5.24-akpm/include/linux/page-flags.h Thu Jul 4 23:20:02 2002 @@ -65,6 +65,7 @@ #define PG_private 12 /* Has something at ->private */ #define PG_writeback 13 /* Page is under writeback */ #define PG_nosave 15 /* Used for system suspend/resume */ +#define PG_throttle 16 /* page allocator should throttle */ /* * Global page accounting. One instance per CPU. @@ -216,6 +217,12 @@ extern void get_page_state(struct page_s #define ClearPageNosave(page) clear_bit(PG_nosave, &(page)->flags) #define TestClearPageNosave(page) test_and_clear_bit(PG_nosave, &(page)->flags) +#define PageThrottle(page) test_bit(PG_throttle, &(page)->flags) +#define SetPageThrottle(page) set_bit(PG_throttle, &(page)->flags) +#define TestSetPageThrottle(page) test_and_set_bit(PG_throttle, &(page)->flags) +#define ClearPageThrottle(page) clear_bit(PG_throttle, &(page)->flags) +#define TestClearPageThrottle(page) test_and_clear_bit(PG_throttle, &(page)->flags) + /* * The PageSwapCache predicate doesn't use a PG_flag at this time, * but it may again do so one day. --- 2.5.24/mm/filemap.c~second-chance-throttle Thu Jul 4 23:20:08 2002 +++ 2.5.24-akpm/mm/filemap.c Thu Jul 4 23:20:23 2002 @@ -682,6 +682,7 @@ void end_page_writeback(struct page *pag { wait_queue_head_t *waitqueue = page_waitqueue(page); smp_mb__before_clear_bit(); + ClearPageThrottle(page); if (!TestClearPageWriteback(page)) BUG(); smp_mb__after_clear_bit(); - -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/