From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56F33C43331 for ; Wed, 13 Nov 2019 00:31:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 15934206BA for ; Wed, 13 Nov 2019 00:31:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 15934206BA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9F8C36B0005; Tue, 12 Nov 2019 19:31:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 982556B0006; Tue, 12 Nov 2019 19:31:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 871DD6B0007; Tue, 12 Nov 2019 19:31:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0233.hostedemail.com [216.40.44.233]) by kanga.kvack.org (Postfix) with ESMTP id 6BF706B0005 for ; Tue, 12 Nov 2019 19:31:26 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with SMTP id 1D524181AEF10 for ; Wed, 13 Nov 2019 00:31:26 +0000 (UTC) X-FDA: 76149375372.02.pan61_c6bc46f1ec14 X-HE-Tag: pan61_c6bc46f1ec14 X-Filterd-Recvd-Size: 9211 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by imf50.hostedemail.com (Postfix) with ESMTP for ; Wed, 13 Nov 2019 00:31:25 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga104.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Nov 2019 16:31:23 -0800 X-IronPort-AV: E=Sophos;i="5.68,298,1569308400"; d="scan'208";a="207288831" Received: from ahduyck-desk1.jf.intel.com ([10.7.198.76]) by orsmga003-auth.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 12 Nov 2019 16:31:23 -0800 Message-ID: <324a154d050d44d3b8a85a3e08a64fe4bd75b72c.camel@linux.intel.com> Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree From: Alexander Duyck To: David Hildenbrand , Michal Hocko Cc: akpm@linux-foundation.org, aarcange@redhat.com, dan.j.williams@intel.com, dave.hansen@intel.com, konrad.wilk@oracle.com, lcapitulino@redhat.com, mgorman@techsingularity.net, mm-commits@vger.kernel.org, mst@redhat.com, osalvador@suse.de, pagupta@redhat.com, pbonzini@redhat.com, riel@surriel.com, vbabka@suse.cz, wei.w.wang@intel.com, willy@infradead.org, yang.zhang.wz@gmail.com, linux-mm@kvack.org Date: Tue, 12 Nov 2019 16:31:23 -0800 In-Reply-To: References: <20191106121605.GH8314@dhcp22.suse.cz> <20191106165416.GO8314@dhcp22.suse.cz> <4cf64ff9-b099-d50a-5c08-9a8f3a2f52bf@redhat.com> <131f72aa-c4e6-572d-f616-624316b62842@redhat.com> <1d881e86ed58511b20883fd0031623fe6cade480.camel@linux.intel.com> <8a407188-5dd2-648b-fc26-f03a826bfee3@redhat.com> <4be6114f57934eb1478f84fd1358a7fcc547b248.camel@linux.intel.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.30.5 (3.30.5-1.fc29) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 2019-11-13 at 00:10 +0100, David Hildenbrand wrote: > > > > > start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated() > > > > > along with a lockless check if the page is free. > > > > > > > > Okay, that part I think I get. However doesn't all that logic more or less > > > > ignore the watermarks? It seems like you could cause an OOM if you don't > > > > have the necessary checks in place for that. > > > > > > Any approach that temporarily blocks some free pages from getting > > > allocated will essentially have this issue, no? I think one main design > > > point to minimize false OOMs was to limit the number of pages we report > > > at a time. Or what do you propose here in addition to that? > > > > If you take a look at __isolate_free_page it was performing a check to see > > if pulling the page would place us below the minimum watermark for pages. > > Odds are you should probably look at somehow incorporating that into the > > solution before you pull the page. I have updated my approach to check for > > Ah, now I see what you mean. Makes sense! > > > the low watermark with the full capacity of MAX_ORDER - 1 pages before I > > start reporting, and then I am using __isolate_free_page which will check > > the minimum watermark to make sure I don't cross that. > > Yeah, you probably want to check the watermark before doing any > reporting - I assume. > > > > > > I think it should be something like this (ignoring different > > > > > migratetypes and such for now) > > > > > > > > > > 1. Test lockless if page is free: Not free? Done. > > > > > > > > So this should help to reduce the liklihood of races in the steps below. > > > > However it might also be useful if the code had some other check to see if > > > > it was done other than just making a pass through the bitmap. > > > > > > Yes. > > > > > > > One thing I had brought up with Nitesh was the idea of maybe doing some > > > > sort of RCU bitmap type approach. Basically while we hold the zone lock we > > > > could swap out the old bitmap for a new one. We could probably even keep a > > > > counter at the start of the structure so that we could track how many bits > > > > are actually set there. Then it becomes less likely of having a race where > > > > you free a page and set the bit and the hinting thread tests and clears > > > > the bit but doesn't see the freed page since it is not synchronized. > > > > Otherwise your notification setup and reporting thread may need a few smp > > > > barriers added where necessary. > > > > > > Yes, swapping out the bitmap via RCU is also be a way to make memory > > > hotplug work. > > > > > > I was also thinking about a different bitmap approach. Store for each > > > section a bitmap. Use a meta bitmap with a bit for each section that > > > contains pages to report. Sparse zones and memory hot(un)plug would not > > > be a real issue anymore. > > > > I had thought about that too. The only problem is that the section has to > > be power of 2 sized and I don't know if we want to be increasing the size > > ... are there sections that are not a power of 2? x86_64: 128MB, s390x: > 256MB, ... No, what I meant was the mem_section structure. It has a hard requirement about being power of 2 aligned and is already 16 bytes, or 32 with page extensions enabled. There is room for a pad in the page extension case so maybe you could squeeze in something there. > It does not really make sense to have sections that are not a power of > two, thinking about page tables ... I would really be interested where > something like that is possible. Sorry for the confusion on that. > > 2. start_isolate_page_range(): Busy? Rare race (with other isolate users > > > > > > > > Doesn't this have the side effect of draining all the percpu caches in > > > > order to make certain to flush the pages we isolated from there? > > > > > > While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't > > > think isolation will. Where did you spot something like this in > > > mm/page_isolation.c? > > > > On the end of set_migratetype_isolate(). The last thing it does is call > > drain_all_pages. > > Ahh, missed that, thanks. Yeah, one could probably make the > configurable, because for that use case, where we already expect a free > page, we don't need that. I suppose but that gets back into adding complexity as we now have to special case isolation to work with page reporting. > > pages in the region are isolated since as you pointed out you get an EBUSY > > when you attempt to isolate a page that is already isolated and as such > > removal will fail won't it? > > Right now, yes. > > (we should rework that code either way to return -EAGAIN in that case > and let memory offlining try again automatically. But we have to rework > the -EAGAIN vs. -EBUSY handling in memory offlining code at one point > either way, I discussed that partially with Michal recently. There is a > lot of cleaning up to do.) So it sounds like a cleanup/rewrite of some of the isolation code will be needed to really get it doing what you want. Actually I wonder if we couldn't look at something like the free_reported_page function I did and instead split it up so that it could be used as an inverse of __isolate_free_page. Maybe something like a __free_isolated_page. > > > > still having to use the scatterlist in order to hold the pages and track > > > > what you will need to undo the isolation later. > > > > > > I think it is very neat and not complex at all. Page isolation is a nice > > > feature we have in the kernel. :) It deserves some cleanups, though. > > > > We can agree to disagree. At this point you are talking about adding bits > > for sections and pages, and in the meantime I am working with zones and > > pages. I believe finding free space in the section may be much more tricky > > than finding it in the zone or page has been. Now that I am rid of the > > list manipulators my approach may soon surpass the bitmap one in terms of > > being less intrusive/complex.. :-) > > I am definitely interested to see that approach :) Good to see that the > whole discussion in this big thread turned out to be productive. Yeah, when I started working on the patch split Mel wanted I kind of realized I was optimizing for the shuffle case which really shouldn't be an optimization target. I think I just got focused on it as it was in the way of some of the initial changes I needed to make to handle the notifier.