From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=ID7B=Y7=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4EA45C43331
	for <linux-mm@archiver.kernel.org>; Thu,  7 Nov 2019 22:46:38 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id EC4802178F
	for <linux-mm@archiver.kernel.org>; Thu,  7 Nov 2019 22:46:37 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EC4802178F
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 9B5006B000D; Thu,  7 Nov 2019 17:46:37 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 966EB6B000E; Thu,  7 Nov 2019 17:46:37 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8564E6B0010; Thu,  7 Nov 2019 17:46:37 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0212.hostedemail.com [216.40.44.212])
	by kanga.kvack.org (Postfix) with ESMTP id 6417C6B000D
	for <linux-mm@kvack.org>; Thu,  7 Nov 2019 17:46:37 -0500 (EST)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id 215BB4DB5
	for <linux-mm@kvack.org>; Thu,  7 Nov 2019 22:46:37 +0000 (UTC)
X-FDA: 76130967234.11.fact71_73d6c1f75aa5f
X-HE-Tag: fact71_73d6c1f75aa5f
X-Filterd-Recvd-Size: 16120
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
	by imf32.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  7 Nov 2019 22:46:35 +0000 (UTC)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga006.jf.intel.com ([10.7.209.51])
  by fmsmga101.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 07 Nov 2019 14:46:34 -0800
X-IronPort-AV: E=Sophos;i="5.68,279,1569308400"; 
   d="scan'208";a="206290070"
Received: from ahduyck-desk1.jf.intel.com ([10.7.198.76])
  by orsmga006-auth.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 07 Nov 2019 14:46:33 -0800
Message-ID: <0ce4d7da7b3672a02c8246caa64bd8a659cc7d04.camel@linux.intel.com>
Subject: Re: + mm-introduce-reported-pages.patch added to -mm tree
From: Alexander Duyck <alexander.h.duyck@linux.intel.com>
To: Nitesh Narayan Lal <nitesh@redhat.com>, David Hildenbrand
 <david@redhat.com>,  Michal Hocko <mhocko@kernel.org>
Cc: akpm@linux-foundation.org, aarcange@redhat.com,
 dan.j.williams@intel.com,  dave.hansen@intel.com, konrad.wilk@oracle.com,
 lcapitulino@redhat.com,  mgorman@techsingularity.net,
 mm-commits@vger.kernel.org, mst@redhat.com,  osalvador@suse.de,
 pagupta@redhat.com, pbonzini@redhat.com, riel@surriel.com,  vbabka@suse.cz,
 wei.w.wang@intel.com, willy@infradead.org, yang.zhang.wz@gmail.com, 
 linux-mm@kvack.org
Date: Thu, 07 Nov 2019 14:46:33 -0800
In-Reply-To: <c92fead9-cf88-b56c-4664-5c1e20de40c1@redhat.com>
References: <20191106121605.GH8314@dhcp22.suse.cz>
	 <CD4A882A-91A7-43F2-B31C-3FFD85289907@redhat.com>
	 <dc1b2b3f4db8591303932351971f55688e0e240e.camel@linux.intel.com>
	 <20191106165416.GO8314@dhcp22.suse.cz>
	 <e90877ab21cdb7bb6a7a71a035f1b03e5544f384.camel@linux.intel.com>
	 <f84f53b8-221e-02bd-2e7a-c0040ca03a38@redhat.com>
	 <f1f84779123b7d3f0613d20f6bd9b05c806b39f7.camel@linux.intel.com>
	 <c92fead9-cf88-b56c-4664-5c1e20de40c1@redhat.com>
Content-Type: text/plain; charset="UTF-8"
User-Agent: Evolution 3.30.5 (3.30.5-1.fc29) 
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, 2019-11-07 at 14:37 -0500, Nitesh Narayan Lal wrote:
> On 11/7/19 1:02 PM, Alexander Duyck wrote:
> > On Thu, 2019-11-07 at 00:33 +0100, David Hildenbrand wrote:
> > > On 06.11.19 18:48, Alexander Duyck wrote:
> > > > On Wed, 2019-11-06 at 17:54 +0100, Michal Hocko wrote:
> > > > > On Wed 06-11-19 08:35:43, Alexander Duyck wrote:
> > > > > > On Wed, 2019-11-06 at 15:09 +0100, David Hildenbrand wrote:
> > > > > > > > Am 06.11.2019 um 13:16 schrieb Michal Hocko <mhocko@kerne=
l.org>:
> > > > > > > >=20
> > > > > > > > =EF=BB=BFI didn't have time to read through newer version=
s of this patch series
> > > > > > > > but I remember there were concerns about this functionali=
ty being pulled
> > > > > > > > into the page allocator previously both by me and Mel [1]=
[2]. Have those been
> > > > > > > > addressed? I do not see an ack from Mel or any other MM p=
eople. Is there
> > > > > > > > really a consensus that we want something like that livin=
g in the
> > > > > > > > allocator?
> > > > > > > I don=E2=80=98t think there is. The discussion is still ong=
oing (although quiet,
> > > > > > > Nitesh is working on a new version AFAIK). I think we shoul=
d not rush
> > > > > > > this.
> > > > > > How much time is needed to get a review? I waited 2 weeks sin=
ce posting
> > > > > > v12 and the only comments I got on the code were from Andrew.=
 Most of this
> > > > > > hasn't changed much since v10 and that was posted back in mid=
 September. I
> > > > > > have been down to making small tweaks here and there and have=
n't had any
> > > > > > real critiques on the approach since Mel had the comments abo=
ut conflicts
> > > > > > with compaction which I addressed by allowing compaction to p=
unt the
> > > > > > reporter out so that it could split and splice the lists as i=
t walked
> > > > > > through them.
> > > > > Well, people are busy and MM community is not a large one. I ca=
nnot
> > > > > really help you much other than keep poking those people and gi=
ve
> > > > > reasonable arguments so they decide to ack your patch.
> > > > I get that. But v10 was posted in mid September. Back then we had=
 a
> > > > discussion about addressing what Mel had mentioned and I had ment=
ioned
> > > > then that I had addressed it by allowing compaction to essentiall=
y reset
> > > > the reporter to get it out of the list so compaction could do thi=
s split
> > > > and splice tumbling logic.
> > > >=20
> > > > > I definitely do not intent to nack this work, I just have maint=
ainability
> > > > > concerns and considering there is an alternative approach that =
does not
> > > > > require to touch page allocator internals and which we need to =
compare
> > > > > against then I do not really think there is any need to push so=
mething
> > > > > in right away. Or is there any pressing reason to have this mer=
ged right
> > > > > now?
> > > > The alternative approach doesn't touch the page allocator, howeve=
r it
> > > > still has essentially the same changes to __free_one_page. I susp=
ect the
> > > Nitesh is working on Michals suggestion to use page isolation inste=
ad=20
> > > AFAIK - which avoids this.
> > Okay. However it makes it much harder to discuss when we are comparin=
g
> > against code that isn't public. If the design is being redone do we h=
ave
> > any ETA for when we will have something to actually compare to?
>=20
> If there would have been just the design change then giving definite ET=
A would
> have been possible.
> However, I also have to fix the performance with (MAX_ORDER - 2). Unlik=
e you,
> I need some time to do that.
> If I just post the code without fixing the performance there will again=
 be an
> unnecessary discussion about the same thing which doesn't make any sens=
e.

Understood. However with that being the case I don't think it is really
fair to say that my patch set needs to wait on yours to be completed
before we can review it and decide if it is acceptable for upstream.

With that said I am working on a v14 to address Mel's review feedback. I
will hopefully have that completed and tested by early next week.

> > > > performance issue seen is mostly due to the fact that because it =
doesn't
> > > > touch the page allocator it is taking the zone lock and probing t=
he page
> > > > for each set bit to see if the page is still free. As such the pe=
rformance
> > > > regression seen gets worse the lower the order used for reporting=
.
> > > >=20
> > > > Also I suspect Nitesh's patches are also in need of further revie=
w. I have
> > > > provided feedback however my focus ended up being on more the ker=
nel
> > > > panics and 30% performance regression rather than debating archit=
ecture.
> > > Please don't take this personally, but I really dislike you taking =
about=20
> > > Niteshs RFCs (!) and pushing for your approach (although it was you=
 that=20
> > > was late to the party!) in that way. If there are problems then ple=
ase=20
> > > collaborate and fix instead of using the same wrong arguments over =
and=20
> > > over again.
> > Since Nitesh is in the middle of doing a full rewrite anyway I don't =
have
> > much to compare against except for the previous set, which still need=
s
> > fixes.  It is why I mentioned in the cover of the last patch set that=
 I
> > would prefer to not discuss it since I have no visibility into the pa=
tch
> > set he is now working on.
>=20
> Fair point.
>=20
>=20
> > > a) hotplug/sparse zones: I explained a couple of times why we can i=
gnore=20
> > > that. There was never a reply from you, yet you keep coming up with=
=20
> > > that. I don't enjoy talking to a wall.
> > This gets to the heart of how Nitesh's patch set works. It is assumin=
g
> > that every zone is linear, that there will be no overlap between zone=
s,
> > and that the zones don't really change. These are key architectural
> > assumptions that should really be discussed instead of simply dismiss=
ed.
>=20
> They are not at all dismissed, they are just kept as a future action it=
em.

Yes, but when we are asked to compare the two solutions pointing out the
areas that are not complete is a valid point. David complained about
talking to a wall, but what does he expect when we is comparing an RFC
versus something that is ready for acceptance. He is going to get a list
of things that still need to be completed and other issues that were
identified with the patch set.


> > I guess part of the difference between us is that I am looking for
> > something that is production ready and not a proof of concept. It sou=
nds
> > like you would prefer this work stays in a proof of concept stage for=
 some
> > time longer.
>=20
> In my opinion, it is more about how many use-cases do we want to target
> initially.
> With your patch-set, I agree we can cover more use-case where the solut=
ion
> will fit in.
> However, my series might not be suitable for use-cases where
> we have memory hotplug or memory restriction. (This will be the case
> after I fix the issues in the series)

This is the difference between a proof of concept and production code in
my opinion. I have had to go through and cover all the corner cases, make
sure this compiles on architectures other than x86, and try to validate a=
s
much as possible in terms of possible regressions. As such I may have had
to make performance compromises here and there in order to make sure all
those cases are covered.

> > > b) Locking optimizations: Come on, these are premature optimization=
s and=20
> > > nothing to dictate your design. *nobody* but you cares about that i=
n an=20
> > > initial approach we get upstream. We can always optimize that.
> > My concern isn't so much the locking as the fact that it is the hunt =
and
> > peck approach through a bitmap that will become increasingly more sta=
le as
> > you are processing the data. Every bit you have to test for requires
> > taking a zone lock and then probing to see if the page is still free =
and
> > the right size. My concern is how much time is going to be spent with=
 the
> > zone lock held while other CPUs are waiting on access.
>=20
> This can be prevented (at least to an extent) by checking if the page i=
s in
> the buddy before acquiring the lock as I have suggested previously.

Agreed that should help to reduce the pain somewhat. However it still
isn't an exact science since you may find that the page state changes
before you can acquire the zone lock.

> > > c) Kernel panics: Come on, we are talking about complicated RFCs he=
re=20
> > > with moving design decisions. We want do discuss *design* and=20
> > > *architecture* here, not *implementation details*.
> > Then why ask me to compare performance against it? You were the one
> > pushing for me to test it, not me. If you and Nitesh knew the design
> > wasn't complete enough to run it why ask me to test it?
> >=20
> > Many of the kernel panics for the patch sets in the past have been re=
lated
> > to fundamental architectural issues. For example ignoring things like
> > NUMA, mangling the free_list by accessing it with the wrong locks hel=
d,
> > etc.
>=20
> Obviously we didn't know it earlier, whatever tests I had tried I didn'=
t see
> any issues with them.
> Again, I am trying to learn from my mistakes and appreciate you helping
> me out with that.

I understand that. However I have been burned a few times by this now so =
I
feel it is valid to point out that there have been ongoing issues such as
this when doing a comparison. Especially when the complaint is that my
approach is "fragile".

> > > d) Performance: We want to see a design that fits into the whole=20
> > > architecture cleanly, is maintainable, and provides a benefit. Of=20
> > > course, performance is relevant, but it certainly should not dictat=
e our=20
> > > design of a *virtualization specific optimization feature*. Perform=
ance=20
> > > is not everything, otherwise please feel free and rewrite the kerne=
l in=20
> > > ASM and claim it is better because it is faster.
> > I agree performance is not everything. But when a system grinds down =
to
> > 60% of what it was originally I find that significant.
>=20
> 60%? In one of your previous emails you suggested that the drop was 30%=
.

I was rounding down in both cases by basically just dropping the last
digit for 0. As I recall it was something in the mid 30s that I was seein=
g
for a drop. I was being a bit more generous before, but I was feeling a
bit irritable so I flipped things and rounded the percentage it retained
down.

> > > Again, I do value your review and feedback, but I absolutely do not=
=20
> > > enjoy the way you are trying to push your series here, sorry.
> > Well I am a bit frustrated as I have had to provide a significant amo=
unt
> > of feedback on Nitesh's patches, and in spite of that I feel like I a=
m
> > getting nothing in return.
>=20
> Not sure if I understood the meaning here. May I know what were you exp=
ecting?
> I do try to review your series and share whatever I can.

So for all the review feedback I have provided I feel like I haven't
gotten much back. I have had several iterations of the patch set that got
almost no replies. Then on top of it instead of getting suggestions on ho=
w
to improve my patch set what ends up happening at some point is that your
patch set gets brought up and muddies the waters since we start discussin=
g
the issues with it instead of addressing issues in my patch set.

> >  I have pointed out the various issues and
> > opportunities to address the issues. At this point there are sections=
 of
> > his code that are directly copied from mine[1].=20
>=20
> I don't think there is any problem with learning from you or your code.
> Is there?

I don't have any problems with you learning from my code, and the fact is
it shows we are making progress since at this point the virtio-balloon
bits are now essentially identical. The point that I was trying to make i=
s
that I have been contributing to the progress that has been made.

> As far as giving proper credit is concerned, that was my mistake and I =
intend
> to correct it as I have already mentioned.

I appreciate that.

> > I have done everything I
> > can to help the patches along but it seems like they aren't getting o=
ut of
> > RFC or proof-of-concept state any time soon.=20
>=20
> So one reason for that is definitely the issues you pointed out.
> But also since I started working on this project, I kept getting differ=
ent design
> suggestions. Which according to me is excellent. However, adopting them=
 and
> making those changes could be easy for you but I have to take my time t=
o
> properly understand them before implementing them.

Understood. In my case I am suffering from the opposite problem. I am
taking the input and iterating pretty quickly at this point. However I
haven't been getting much feedback. Thus my frustration when the patches
are applied and then people start wanting them reverted until they can be
tested against your patch set.

> > So with that being the case
> > why not consider his patch set as something that could end up being a
> > follow-on/refactor instead of an alternative to mine?
> >=20
>=20
> I have already mentioned that I would like to see the solution which is=
 better
> and has a consensus (It doesn't matter from where it is coming from).

My primary focus is getting a solution in place. I'm not sure if
management has changed much at Red Hat since I was there but usually ther=
e
is a push to get things completed is there not? In my case I would like t=
o
switch from development to maintenance for memory hinting, and if you end
up with a better approach I would be more than open to refactoring it
later and/or throwing out my code to replace it with yours.

There is a saying "Perfection is the enemy of progress." My concern is
that we are spending so much time trying to find the perfect solution we
aren't going to ever come up with one. After all the concept has been
around since 2011 and we are still debating how to go about implementing
it.