From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QUr6=ZE=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1BFC2C43331
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 21:05:42 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id B473720659
	for <linux-mm@archiver.kernel.org>; Tue, 12 Nov 2019 21:05:41 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="JJPOu3xz"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B473720659
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 4C7496B0003; Tue, 12 Nov 2019 16:05:41 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 49EFE6B0005; Tue, 12 Nov 2019 16:05:41 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 33FC06B0006; Tue, 12 Nov 2019 16:05:41 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0078.hostedemail.com [216.40.44.78])
	by kanga.kvack.org (Postfix) with ESMTP id 140696B0003
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 16:05:41 -0500 (EST)
Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id C894D181AC9CB
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 21:05:40 +0000 (UTC)
X-FDA: 76148856840.21.shelf05_6bb944d64fa09
X-HE-Tag: shelf05_6bb944d64fa09
X-Filterd-Recvd-Size: 14453
Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com [205.139.110.120])
	by imf19.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 12 Nov 2019 21:05:40 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1573592739;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:autocrypt:autocrypt;
	bh=Ee4l70ypyLVZi4AN9ZWAxCYT/EnMrmPQRqbGlt32S6s=;
	b=JJPOu3xzF4zoWm7sheqfEvLpTjQTqgKJ1lRBfjxjxi3oz6QaWkSqcvFkeztAd3OPIL8S70
	S7f6qt3L+rXepq14haANxk+Thrt8CbfGn2DzbC0JXl2gU7Xt23rfOgZJOeFec+ZLCtLs/J
	iRIW6dB5IQXru6UUSQD+P+mFXRwcrxg=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-276-l4U0T5diP82BhxXAOVuQtA-1; Tue, 12 Nov 2019 16:05:37 -0500
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id EB9391005513;
	Tue, 12 Nov 2019 21:05:34 +0000 (UTC)
Received: from [10.36.116.48] (ovpn-116-48.ams2.redhat.com [10.36.116.48])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 32A484EDC0;
	Tue, 12 Nov 2019 21:05:24 +0000 (UTC)
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
To: Alexander Duyck <alexander.h.duyck@linux.intel.com>,
 Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com, dan.j.williams@intel.com,
 dave.hansen@intel.com, konrad.wilk@oracle.com, lcapitulino@redhat.com,
 mgorman@techsingularity.net, mm-commits@vger.kernel.org, mst@redhat.com,
 osalvador@suse.de, pagupta@redhat.com, pbonzini@redhat.com,
 riel@surriel.com, vbabka@suse.cz, wei.w.wang@intel.com, willy@infradead.org,
 yang.zhang.wz@gmail.com, linux-mm@kvack.org
References: <20191106121605.GH8314@dhcp22.suse.cz>
 <CD4A882A-91A7-43F2-B31C-3FFD85289907@redhat.com>
 <dc1b2b3f4db8591303932351971f55688e0e240e.camel@linux.intel.com>
 <20191106165416.GO8314@dhcp22.suse.cz>
 <e90877ab21cdb7bb6a7a71a035f1b03e5544f384.camel@linux.intel.com>
 <f84f53b8-221e-02bd-2e7a-c0040ca03a38@redhat.com>
 <f1f84779123b7d3f0613d20f6bd9b05c806b39f7.camel@linux.intel.com>
 <4cf64ff9-b099-d50a-5c08-9a8f3a2f52bf@redhat.com>
 <d47d796d11d0943aaebc4864767b6ecba22f9a34.camel@linux.intel.com>
 <131f72aa-c4e6-572d-f616-624316b62842@redhat.com>
 <1d881e86ed58511b20883fd0031623fe6cade480.camel@linux.intel.com>
 <8a407188-5dd2-648b-fc26-f03a826bfee3@redhat.com>
 <4be6114f57934eb1478f84fd1358a7fcc547b248.camel@linux.intel.com>
From: David Hildenbrand <david@redhat.com>
Autocrypt: addr=david@redhat.com; prefer-encrypt=mutual; keydata=
 mQINBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ
 dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL
 QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp
 XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK
 Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9
 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt
 WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc
 UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv
 jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb
 B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABtCREYXZpZCBIaWxk
 ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT6JAj4EEwECACgFAljj9eoCGwMFCQlmAYAGCwkI
 BwMCBhUIAgkKCwQWAgMBAh4BAheAAAoJEE3eEPcA/4Na5IIP/3T/FIQMxIfNzZshIq687qgG
 8UbspuE/YSUDdv7r5szYTK6KPTlqN8NAcSfheywbuYD9A4ZeSBWD3/NAVUdrCaRP2IvFyELj
 xoMvfJccbq45BxzgEspg/bVahNbyuBpLBVjVWwRtFCUEXkyazksSv8pdTMAs9IucChvFmmq3
 jJ2vlaz9lYt/lxN246fIVceckPMiUveimngvXZw21VOAhfQ+/sofXF8JCFv2mFcBDoa7eYob
 s0FLpmqFaeNRHAlzMWgSsP80qx5nWWEvRLdKWi533N2vC/EyunN3HcBwVrXH4hxRBMco3jvM
 m8VKLKao9wKj82qSivUnkPIwsAGNPdFoPbgghCQiBjBe6A75Z2xHFrzo7t1jg7nQfIyNC7ez
 MZBJ59sqA9EDMEJPlLNIeJmqslXPjmMFnE7Mby/+335WJYDulsRybN+W5rLT5aMvhC6x6POK
 z55fMNKrMASCzBJum2Fwjf/VnuGRYkhKCqqZ8gJ3OvmR50tInDV2jZ1DQgc3i550T5JDpToh
 dPBxZocIhzg+MBSRDXcJmHOx/7nQm3iQ6iLuwmXsRC6f5FbFefk9EjuTKcLMvBsEx+2DEx0E
 UnmJ4hVg7u1PQ+2Oy+Lh/opK/BDiqlQ8Pz2jiXv5xkECvr/3Sv59hlOCZMOaiLTTjtOIU7Tq
 7ut6OL64oAq+uQINBFXLn5EBEADn1959INH2cwYJv0tsxf5MUCghCj/CA/lc/LMthqQ773ga
 uB9mN+F1rE9cyyXb6jyOGn+GUjMbnq1o121Vm0+neKHUCBtHyseBfDXHA6m4B3mUTWo13nid
 0e4AM71r0DS8+KYh6zvweLX/LL5kQS9GQeT+QNroXcC1NzWbitts6TZ+IrPOwT1hfB4WNC+X
 2n4AzDqp3+ILiVST2DT4VBc11Gz6jijpC/KI5Al8ZDhRwG47LUiuQmt3yqrmN63V9wzaPhC+
 xbwIsNZlLUvuRnmBPkTJwwrFRZvwu5GPHNndBjVpAfaSTOfppyKBTccu2AXJXWAE1Xjh6GOC
 8mlFjZwLxWFqdPHR1n2aPVgoiTLk34LR/bXO+e0GpzFXT7enwyvFFFyAS0Nk1q/7EChPcbRb
 hJqEBpRNZemxmg55zC3GLvgLKd5A09MOM2BrMea+l0FUR+PuTenh2YmnmLRTro6eZ/qYwWkC
 u8FFIw4pT0OUDMyLgi+GI1aMpVogTZJ70FgV0pUAlpmrzk/bLbRkF3TwgucpyPtcpmQtTkWS
 gDS50QG9DR/1As3LLLcNkwJBZzBG6PWbvcOyrwMQUF1nl4SSPV0LLH63+BrrHasfJzxKXzqg
 rW28CTAE2x8qi7e/6M/+XXhrsMYG+uaViM7n2je3qKe7ofum3s4vq7oFCPsOgwARAQABiQIl
 BBgBAgAPBQJVy5+RAhsMBQkJZgGAAAoJEE3eEPcA/4NagOsP/jPoIBb/iXVbM+fmSHOjEshl
 KMwEl/m5iLj3iHnHPVLBUWrXPdS7iQijJA/VLxjnFknhaS60hkUNWexDMxVVP/6lbOrs4bDZ
 NEWDMktAeqJaFtxackPszlcpRVkAs6Msn9tu8hlvB517pyUgvuD7ZS9gGOMmYwFQDyytpepo
 YApVV00P0u3AaE0Cj/o71STqGJKZxcVhPaZ+LR+UCBZOyKfEyq+ZN311VpOJZ1IvTExf+S/5
 lqnciDtbO3I4Wq0ArLX1gs1q1XlXLaVaA3yVqeC8E7kOchDNinD3hJS4OX0e1gdsx/e6COvy
 qNg5aL5n0Kl4fcVqM0LdIhsubVs4eiNCa5XMSYpXmVi3HAuFyg9dN+x8thSwI836FoMASwOl
 C7tHsTjnSGufB+D7F7ZBT61BffNBBIm1KdMxcxqLUVXpBQHHlGkbwI+3Ye+nE6HmZH7IwLwV
 W+Ajl7oYF+jeKaH4DZFtgLYGLtZ1LDwKPjX7VAsa4Yx7S5+EBAaZGxK510MjIx6SGrZWBrrV
 TEvdV00F2MnQoeXKzD7O4WFbL55hhyGgfWTHwZ457iN9SgYi1JLPqWkZB0JRXIEtjd4JEQcx
 +8Umfre0Xt4713VxMygW0PnQt5aSQdMD58jHFxTk092mU+yIHj5LeYgvwSgZN4airXk5yRXl
 SE+xAvmumFBY
Organization: Red Hat GmbH
Message-ID: <cfeeb6e4-ba73-bbf7-0cc3-d0c011d00971@redhat.com>
Date: Tue, 12 Nov 2019 22:05:23 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.1.1
MIME-Version: 1.0
In-Reply-To: <4be6114f57934eb1478f84fd1358a7fcc547b248.camel@linux.intel.com>
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-MC-Unique: l4U0T5diP82BhxXAOVuQtA-1
X-Mimecast-Spam-Score: 0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 12.11.19 19:34, Alexander Duyck wrote:
> On Tue, 2019-11-12 at 14:04 +0100, David Hildenbrand wrote:
>>>>> fact is it is still invasive, just to different parts of the mm subsy=
stem.
>>>>
>>>> I'd love to see how it uses the page isolation framework, and only has=
 a=20
>>>> single hook to queue pages. I don't like the way pages are pulled out =
of=20
>>>> the buddy in Niteshs approach currently. What you have is cleaner.
>>>
>>> I don't see how you could use the page isolation framework to pull out
>>> free pages. Is there a thread somewhere on the topic that I missed?
>>
>> It's basically only isolating pages while reporting them, and not
>> pulling them out of the buddy (IOW, you move the pages to the isolate
>> queues where nobody is allowed to touch them, and setting the
>> migratetype properly). This e.g., makes other user of page isolation
>> (e.g., memory offlining, alloc_contig_range()) play nicely with these
>> isolated pages. "somebody else just isolated them, please try again."
>=20
> How so? If I understand correctly there isn't anything that prevents you
> from isolating an already isolated page, is there? Last I knew isolated

mm/page_isolation.c:set_migratetype_isolate()
...
if (is_migrate_isolate_page(page))
=09goto out;
...
-> Currently -EBUSY

> pages are still considered "movable" since they are still buddy pages
> aren't they?

They are neither movable nor unmovable AFAIK. They are temporarily
blocked. E.g., memory offlining currently returns -EBUSY if it cannot
isolate the page range. alloc_contig_range() does the same. Imagine
somebody allocating a gigantic page. You certainly cannot move the pages
that are isolated while allocating the page. But you can signal to the
caller to try again later.

>=20
> Also this seems like it would have other implications since isolating a
> page kicks of the memory notifier so as a result a balloon driver would
> then free the pages back out so that they could be isolated with the
> assumption the region is going offline.

Memory notifier? Balloon pages getting freed? No.

The memory notifier is used for onlining/offlining, it is not involved here=
.

I think what you mean is the "isolate notifier", which is only used by
CMM on PPC.

See https://lkml.org/lkml/2019/10/31/487, where I rip that notifier out.

>=20
>> start_isolate_page_range()/undo_isolate_page_range()/test_pages_isolated=
()
>> along with a lockless check if the page is free.
>=20
> Okay, that part I think I get. However doesn't all that logic more or les=
s
> ignore the watermarks? It seems like you could cause an OOM if you don't
> have the necessary checks in place for that.

Any approach that temporarily blocks some free pages from getting
allocated will essentially have this issue, no? I think one main design
point to minimize false OOMs was to limit the number of pages we report
at a time. Or what do you propose here in addition to that?

>=20
>> I think it should be something like this (ignoring different
>> migratetypes and such for now)
>>
>> 1. Test lockless if page is free: Not free? Done.
>=20
> So this should help to reduce the liklihood of races in the steps below.
> However it might also be useful if the code had some other check to see i=
f
> it was done other than just making a pass through the bitmap.

Yes.

>=20
> One thing I had brought up with Nitesh was the idea of maybe doing some
> sort of RCU bitmap type approach. Basically while we hold the zone lock w=
e
> could swap out the old bitmap for a new one. We could probably even keep =
a
> counter at the start of the structure so that we could track how many bit=
s
> are actually set there. Then it becomes less likely of having a race wher=
e
> you free a page and set the bit and the hinting thread tests and clears
> the bit but doesn't see the freed page since it is not synchronized.
> Otherwise your notification setup and reporting thread may need a few smp
> barriers added where necessary.

Yes, swapping out the bitmap via RCU is also be a way to make memory
hotplug work.

I was also thinking about a different bitmap approach. Store for each
section a bitmap. Use a meta bitmap with a bit for each section that
contains pages to report. Sparse zones and memory hot(un)plug would not
be a real issue anymore.

One could go one step further and only have a bitmap with a bit for each
section. Only remember that some (large) page was not reported in that
section (e.g., after buddy merging). In the reporting thread, report all
free pages within that section. You could end up reporting the same page
a couple of times, but the question would be if this is relevant at all.
One would have to prototype and measure that.

Long story short, I am not 100% a fan of the current "bitmap per zone"
approach but is is fairly simple to start with :)

>=20
>> 2. start_isolate_page_range(): Busy? Rare race (with other isolate users
>=20
> Doesn't this have the side effect of draining all the percpu caches in
> order to make certain to flush the pages we isolated from there?

While alloc_contig_range() e.g., calls lru_add_drain_all(), I don't
think isolation will. Where did you spot something like this in
mm/page_isolation.c?

>=20
>> or with an allocation). Done.
>> 3. test_pages_isolated()
>=20
> So I have reviewed the code and I don't see how this could conflict with
> other callers isolating the pages. If anything it seems like if another
> thread has already isolated the pages you would end up getting a false
> positive, reporting the pages, and pulling them back out of isolation.

Isolated pages cannot be isolated. This is tracked via the migratetype.

>=20
>> 3a. no? Rare race, page not free anymore. undo_isolate_page_range()
>=20
> I would hope it is rare. However for something like a max order page I
> could easily see a piece of it having been pulled out. I would think this
> case would be exceedingly expensive since you would have to put back any
> pages you had previous moved into isolation.

I guess it is rare, there is a tiny slot between checking if the page is
free and isolating it. Would have to see that in action.

>=20
>> 3b. yes? Report, then undo_isolate_page_range()
>>
>> If we would run into performance issues with the current page isolation
>> implementation (esp. locking), I think there are some nice
>> cleanups/reworks possible of which all current users could benefit
>> (especially accross pageblocks).
>=20
> To me this feels a lot like what you had for this solution near the start=
.
> Only now instead of placing the pages into an array you are tracking a
> bitmap and then using that bitmap to populate the MIGRATE_ISOLATE lists.

Now we have a clean MM interface to do that :) And yes, which data
structure we're using becomes irrelevant.

>=20
> This sounds far more complex to me then it probably needs to be since jus=
t
> holding the pages with the buddy type cleared should be enough to make
> them temporarily unusable for other threads, and even in your case you ar=
e

If you have a page that is not PageBuddy() and not movable within
ZONE_MOVABLE, has_unmovable_pages() will WARN_ON_ONCE(zone_idx(zone) =3D=3D
ZONE_MOVABLE). This can be triggered via memory offlining, when
isolating the page range.

If your approach does exactly that (clear PageBuddy() on a
ZONE_MOVABLE), it would be a bug. The only safe way is to have the
pageblock(s) isolated.

> still having to use the scatterlist in order to hold the pages and track
> what you will need to undo the isolation later.

I think it is very neat and not complex at all. Page isolation is a nice
feature we have in the kernel. :) It deserves some cleanups, though.

--=20

Thanks,

David / dhildenb