From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from zps78.corp.google.com (zps78.corp.google.com [172.25.146.78]) by smtp-out.google.com with ESMTP id mB4MRuYl025959 for ; Thu, 4 Dec 2008 14:27:56 -0800 Received: from wf-out-1314.google.com (wff29.prod.google.com [10.142.6.29]) by zps78.corp.google.com with ESMTP id mB4MQuBB007855 for ; Thu, 4 Dec 2008 14:27:55 -0800 Received: by wf-out-1314.google.com with SMTP id 29so4354554wff.10 for ; Thu, 04 Dec 2008 14:27:55 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <492E97FA.5000804@gmail.com> References: <604427e00811212247k1fe6b63u9efe8cfe37bddfb5@mail.gmail.com> <20081126123246.GB23649@wotan.suse.de> <492DAA24.8040100@google.com> <20081127085554.GD28285@wotan.suse.de> <492E6849.6090205@google.com> <492E8708.4060601@gmail.com> <20081127120330.GM28285@wotan.suse.de> <492E90BC.1090208@gmail.com> <20081127123926.GN28285@wotan.suse.de> <492E97FA.5000804@gmail.com> Date: Thu, 4 Dec 2008 14:27:54 -0800 Message-ID: <604427e00812041427j7f1c8118p48b1b5b577143703@mail.gmail.com> Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY From: Ying Han Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 8BIT Sender: owner-linux-mm@kvack.org Return-Path: To: =?ISO-8859-1?Q?T=F6r=F6k_Edwin?= Cc: Nick Piggin , Mike Waychison , Ingo Molnar , linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm , David Rientjes , Rohit Seth , Hugh Dickins , Peter Zijlstra , "H. Peter Anvin" List-ID: I am trying your test program(scalability) in house, but somehow i got different result as you saw. i created 8 files each with 1G size on separate drives( to avoid the latency disturbing of disk seek). I got this number without applying the batch based on 2.6.26. May i ask how to reproduce the mmap issue you mentioned? 8 CPU read_worker 1 threads Real time: 101.058262 s (since task start) 2 threads Real time: 50.670456 s (since task start) 4 threads Real time: 25.904657 s (since task start) 8 threads Real time: 20.090677 s (since task start) -------------------------------------------------------------------------------- mmap_worker 1 threads Real time: 101.340662 s (since task start) 2 threads Real time: 51.484646 s (since task start) 4 threads Real time: 28.414534 s (since task start) 8 threads Real time: 21.785818 s (since task start) --Ying On Thu, Nov 27, 2008 at 4:52 AM, Torok Edwin wrote: > On 2008-11-27 14:39, Nick Piggin wrote: >> On Thu, Nov 27, 2008 at 02:21:16PM +0200, Torok Edwin wrote: >> >>> On 2008-11-27 14:03, Nick Piggin wrote: >>> >>>>> Running my testcase shows no significant performance difference. What am >>>>> I doing wrong? >>>>> >>>>> >>>> >>>> Software may just be doing a lot of mmap/munmap activity. threads + >>>> mmap is never going to be pretty because it is always going to involve >>>> broadcasting tlb flushes to other cores... Software writers shouldn't >>>> be scared of using processes (possibly with some shared memory). >>>> >>>> >>> It would be interesting to compare the performance of a threaded clamd, >>> and of a clamd that uses multiple processes. >>> Distributing tasks will be a bit more tricky, since it would need to use >>> IPC, instead of mutexes and condition variables. >>> >> >> Yes, although you could use PTHREAD_PROCESS_SHARED pthread mutexes on >> the shared memory I believe (having never tried it myself). >> >> >> >>>> Actually, a lot of things get faster (like malloc, or file descriptor >>>> operations) because locks aren't needed. >>>> >>>> Despite common perception, processes are actually much *faster* than >>>> threads when doing common operations like these. They are slightly slower >>>> sometimes with things like creation and exit, or context switching, but >>>> if you're doing huge numbers of those operations, then it is unlikely >>>> to be a performance critical app... :) >>>> >>>> >>> How about distributing tasks to a set of worked threads, is the overhead >>> of using IPC instead of >>> mutexes/cond variables acceptable? >>> >> >> It is really going to depend on a lot of things. What is involved in >> distributing tasks, how many cores and cache/TLB architecture of the >> system running on, etc. >> >> You want to distribute as much work as possible while touching as >> little memory as possible, in general. >> >> But if you're distributing threads over cores, and shared caches are >> physically tagged (which I think all x86 CPUs are), then you should >> be able to have multiple processes operate on shared memory just as >> efficiently as multiple threads I think. >> >> And then you also get the advantages of reduced contention on other >> shared locks and resources. >> > > Thanks for the tips, but lets get back to the original question: > why don't I see any performance improvement with the fault-retry patches? > > My testcase only compares reads file with mmap, vs. reading files with > read, with different number of threads. > Leaving aside other reasons why mmap is slower, there should be some > speedup by running 4 threads vs 1 thread, but: > > 1 thread: read:27,18 28.76 > 1 thread: mmap: 25.45, 25.24 > 2 thread: read: 16.03, 15.66 > 2 thread: mmap: 22.20, 20.99 > 4 thread: read: 9.15, 9.12 > 4 thread: mmap: 20.38, 20.47 > > The speed of 4 threads is about the same as for 2 threads with mmap, yet > with read it scales nicely. > And the patch doesn't seem to improve scalability. > How can I find out if the patch works as expected? [i.e. verify that > faults are actually retried, and that they don't keep the semaphore locked] > >> OK, I'll see if I can find them (am overseas at the moment, and I suspect >> they are stranded on some stationary rust back home, but I might be able >> to find them on the web). > > Ok. > > Best regards, > --Edwin > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org