From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=xlWz=YQ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.5 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 75DB2CA9EB6
	for <linux-mm@archiver.kernel.org>; Wed, 23 Oct 2019 11:54:08 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 191E720650
	for <linux-mm@archiver.kernel.org>; Wed, 23 Oct 2019 11:54:07 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 191E720650
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D684D6B0003; Wed, 23 Oct 2019 07:54:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D18246B0006; Wed, 23 Oct 2019 07:54:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C2E8C6B0007; Wed, 23 Oct 2019 07:54:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83])
	by kanga.kvack.org (Postfix) with ESMTP id A20EA6B0003
	for <linux-mm@kvack.org>; Wed, 23 Oct 2019 07:54:06 -0400 (EDT)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with SMTP id 2DC7D6125
	for <linux-mm@kvack.org>; Wed, 23 Oct 2019 11:54:06 +0000 (UTC)
X-FDA: 76074890892.19.sense79_5dd4ae607b661
X-HE-Tag: sense79_5dd4ae607b661
X-Filterd-Recvd-Size: 7201
Received: from r3-20.sinamail.sina.com.cn (r3-20.sinamail.sina.com.cn [202.108.3.20])
	by imf40.hostedemail.com (Postfix) with SMTP
	for <linux-mm@kvack.org>; Wed, 23 Oct 2019 11:54:04 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([222.131.72.81])
	by sina.com with ESMTP
	id 5DB03F57000262D1; Wed, 23 Oct 2019 19:54:01 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 76416615073722
From: Hillf Danton <hdanton@sina.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Hillf Danton <hdanton@sina.com>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Shakeel Butt <shakeelb@google.com>,
	Minchan Kim <minchan@kernel.org>,
	Mel Gorman <mgorman@suse.de>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Jan Kara <jack@suse.cz>
Subject: Re: [RFC v1] mm: add page preemption
Date: Wed, 23 Oct 2019 19:53:50 +0800
Message-Id: <20191023115350.4956-1-hdanton@sina.com>
In-Reply-To: <20191022142802.14304-1-hdanton@sina.com>
References: 
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On Wed, 23 Oct 2019 10:17:29 +0200 Michal Hocko wrote:
>=20
> On Tue 22-10-19 22:28:02, Hillf Danton wrote:
> >=20
> > On Tue, 22 Oct 2019 14:42:41 +0200 Michal Hocko wrote:
> > >=20
> > > On Tue 22-10-19 20:14:39, Hillf Danton wrote:
> > > >=20
> > > > On Mon, 21 Oct 2019 14:27:28 +0200 Michal Hocko wrote:
> > > [...]
> > > > > Why do we care and which workloads would benefit and how much.
> > > >=20
> > > > Page preemption, disabled by default, should be turned on by thos=
e
> > > > who wish that the performance of their workloads can survive memo=
ry
> > > > pressure to certain extent.
> > >=20
> > > I am sorry but this doesn't say anything to me. How come not all
> > > workloads would fit that description?
> >=20
> > That means pp plays a role when kswapd becomes active, and it may
> > prevent too much jitters in active lru pages.
>=20
> This is still too vague to be useful in any way.

Page preemption is designed to function only under memory pressure by
suggesting kswapd to skip deactivating some pages based on prio compariso=
n.
No page will be skipped without difference found in prio by design.
That said, no workload can be picked out before updating prio, so let
users who know that their workloads are sensitive to jitters in lru pages
chage the nice.
We are simply adding the pp feature; users are responsible for turning pp
on and changing nice if they feel necessary.

> > > > The number of pp users is supposed near the people who change the
> > > > nice value of their apps either to -1 or higher at least once a w=
eek,
> > > > less than vi users among UK's undergraduates.
> > > >=20
> > > > > And last but not least why the existing infrastructure doesn't =
help
> > > > > (e.g. if you have clearly defined workloads with different
> > > > > memory consumption requirements then why don't you use memory c=
groups to
> > > > > reflect the priority).
> > > >=20
> > > > Good question:)
> > > >=20
> > > > Though pp is implemented by preventing any task from reclaiming a=
s many
> > > > pages as possible from other tasks that are higher on priority, i=
t is
> > > > trying to introduce prio into page reclaiming, to add a feature.
> > > >=20
> > > > Page and memcg are different objects after all; pp is being added=
 at
> > > > the page granularity. It should be an option available in environ=
ments
> > > > without memcg enabled.
> > >=20
> > > So do you actually want to establish LRUs per priority?
> >=20
> > No, no change other than the prio for every lru page was added. LRU p=
er prio
> > is too much to implement.
>=20
> Well, considering that per page priority is a no go as already pointed
> out by Willy then you do not have other choice right?

No need to seek extra choice because of the prio introduced to reclaiming=
 as
no one is hurt by design without pp enabled and prio updated.

> > > Why using memcgs is not an option?
> >=20
> > I have plan to add prio in memcg. As you see, I sent a rfc before v0 =
with
> > nice added in memcg, and realised a couple days ago that its dependen=
ce on
> > soft limit reclaim is not acceptable.
> >=20
> > But we can't do that without determining how to define memcg's prio.
> > What is in mind now is the highest (or lowest) prio of tasks in a mem=
cg
> > with a knob offered to userspace.
> >=20
> > If you like, I want to have a talk about it sometime later.
>=20
> This doesn't really answer my question.
> Why cannot you use memcgs as they are now.

No prio provided.

> Why exactly do you need a fixed priority?

Prio comparison in global reclaim is what was added. Because every task h=
as
prio makes that comparison possible.

> > > This is the main facility to partition reclaimable
> > > memory in the first place.

Is every task (pid !=3D 1) contained in memcg? And why?

> > > You should really focus on explaining on why
> > > a much more fine grained control is needed much more thoroughly.

Which do you prefer, cello or fiddle? And why?

> > > > What is way different from the protections offered by memory cgro=
up
> > > > is that pages protected by memcg:min/low can't be reclaimed regar=
dless
> > > > of memory pressure. Such guarantee is not available under pp as i=
t only
> > > > suggests an extra factor to consider on deactivating lru pages.
> > >=20
> > > Well, low limit can be breached if there is no eliglible memcg to b=
e
> > > reclaimed. That means that you can shape some sort of priority by
> > > setting the low limit already.
> > >=20
> > > [...]
> > >=20
> > > > What was added on the reclaimer side is
> > > >=20
> > > > 1, kswapd sets pgdat->kswapd_prio, the switch between page reclai=
mer
> > > >    and allocator in terms of prio, to the lowest value before tak=
ing
> > > >    a nap.
> > > >=20
> > > > 2, any allocator is able to wake up the reclaimer because of the
> > > >    lowest prio, and it starts reclaiming pages using the waker's =
prio.
> > > >=20
> > > > 3, allocator comes while kswapd is active, its prio is checked an=
d
> > > >    no-op if kswapd is higher on prio; otherwise switch is updated
> > > >    with the higher prio.
> > > >=20
> > > > 4, every time kswapd raises sc.priority that starts with DEF_PRIO=
RITY,
> > > >    it is checked if there is pending update of switch; and kswapd=
's
> > > >    prio steps up if there is a pending one, thus its prio never s=
teps
> > > >    down. Nor prio inversion.=20
> > > >=20
> > > > 5, goto 1 when kswapd finishes its work.
> > >=20
> > > What about the direct reclaim?
> >=20
> > Their prio will not change before reclaiming finishes, so leave it be=
.
>=20
> This doesn't answer my question.

No prio inversion in direct reclaim if you mean that.

> > > What if pages of a lower priority are
> > > hard to reclaim? Do you want a process of a higher priority stall m=
ore
> > > just because it has to wait for those lower priority pages?
> >=20
> > The problems above are not introduced by pp, let Mr. Kswapd take care=
 of
> > them.
>=20
> No, this is not an answer.

Is pp making them worse?

Thanks
Hillf