From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 30 Jul 2007 06:30:08 +0200 From: Nick Piggin Subject: Re: [patch][rfc] remove ZERO_PAGE? Message-ID: <20070730043008.GB7222@wotan.suse.de> References: <20070727021943.GD13939@wotan.suse.de> <20070727055406.GA22581@wotan.suse.de> <20070730030806.GA17367@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org Return-Path: To: Linus Torvalds Cc: Andrew Morton , Hugh Dickins , Andrea Arcangeli , Linux Memory Management List List-ID: On Sun, Jul 29, 2007 at 08:45:25PM -0700, Linus Torvalds wrote: > > > On Mon, 30 Jul 2007, Nick Piggin wrote: > > > > Well the issue wasn't exactly that, but the fact that a lot of processes > > all exitted at once, while each having a significant number of ZERO_PAGE > > mappings. The freeing rate ends up going right down (OK it wasn't quite a > > livelock, finishing in > 2 hours, but without ZERO_PAGE bouncing they > > exit in 5 seconds). > > Umm. Isn't this because of the new page->mapping counting for reserved > pages? > > The one I was violently against, and told you (and Hugh) was pointless and > bad? > > Or what "bouncing" are you talking about? page->mapcount, yes. I don't quite remember anybody being violently against it, but that's in the past now anyway. > In other words, this doesn't really sound like a ZERO_PAGE problem at all, > but a problem that was engineered by unnecessarily trying to count those > pages in the first place. No? Yeah, but that's a little unfair: it was engineered by removing the code to *not* count those pages :) Another option I gave you was to add back some of that code to avoid this refcounting, but you were violently against that :). > > > Kernel builds with/without this? If the page faults really are that big a > > > deal, this should all be visible. > > > > Sorry if it was misleading: the kernel build numbers weren't really about > > where ZERO_PAGE hurts us, but just trying to show that it doesn't help too > > much (for which I realise anything short of a kernel.org release is sadly > > inadequate). > > > > Anyway, I'll see if I can get anything significant... > > The thing that really riles me up about this is that the whole damn thing > seems to be so pointless. This "ZERO_PAGE is bad" thing has been a > constant background noise, where people are pushing their opinions with no > real technical reasons. > > I want technical reasons, but I get the feeling that this pogrom is abotu > anything but technical arguments. > > So I _really_ don't want to hear you blaming ZERO_PAGE for something that > you introduced yourself with Hugh, and that had _zero_ to do with > ZERO_PAGE, and that I spent weeks saying was pointless, to the point where > I just gave up. > > Now, that you apparently have found the perfect load that proved me right, > you instead of blaming the pointless refcounting, you want to blame the > poor ZERO_PAGE. Again. No, I'm not saying ZERO_PAGE is bad because it has a refcounting scalability problem. We can avoid or eliminate that by *not refcounting it* or doing something crazy like per-node ZERO_PAGEs. My first patch to fix the problem was actually to not refcount it. Remember? The technical argument for this patch is that I'm trying to reason that ZERO_PAGE is no good in the first place. In this debate, the refcounting issue is really moot (IOW, I'm not trying to argue the ZERO_PAGE is bad *because* of the refcounting problem, but because it is not good). > And THAT is what makes me irritated with this patch. I hate the > background, and what looks like intellectual dishonesty. I _told_ people > that refcounting reserved pages was pointless and bad. Did you listen? No. > And now that it causes problems, rather than blame the refcounting, you > blame the victim. > > The zero page is *cheaper* to set up than normal pages. It always was. You > just *made* it more expensive, because of the irrational fear of > PageReserved() that protected us from all those unnecessary bounces in the > first place. > > So a totally equivalent fix would be to just re-instate the PageReserved > checks. It would likely even be easier these days (one logical place to do > so would be in "vm_normal_page()", which automatically would catch all > unmapping cases, but you'd still have to make sure you don't *increment* > the page count when you map it too, of course). I didn't like PageReserved for a number of reasons including that it allowed people to be sloppy with refcounting. However I wouldn't mind adding back some checks to skip counting (although if we do that, let's add a new page flag for it, and PageReserved can eventually disappear). But we should do it on the right level. That is, the struct page itself is a refcounted object. If we skip refcounting for userspace mappings, it should be done there, rather than an ugly hack in put_page. The fear of PageReserved was not irrational -- as I said, I needed to get rid of the put_page special casing for the lockless pagecache. I'm not against properly special casing special pages. > But if you can actually show that ZERO_PAGE literally slows things down > (and none of this page count bouncing crud that you were the one that > introduced in the first place), then _that_ would be a totally different > issue. At that point, you have an independent reason for removing code > that has basically been there since day 1, and all my arguments go away. > > See? > > I'd love to hear "here's a real-life load, and yes, the ZERO_PAGE logic > really does hurt more than it helps, it's time to remove it". At that > point I'll happily apply the patch. > > But what I *don't* want to hear is "we screwed up the reference-counting > of ZERO_PAGE, so now it's so expensive that we want to remove the page > entirely". That just makes me sad. And a bit angry. I'd say it will be hard to actually get a significant real world improvement from removing it vs de-refcounting it. Maybe on an mmap_sem constrained workload, but that seems to be a lot better after the glibc fix and private futexes... But is there a good reason to keep it? You say it is cheaper to set up, but make -j still does 500 extra page faults per second per cpu because of ZERO_PAGE, making it effectively more expensive even ignoring the refcounting problem. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org