From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=wNtT=Y5=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.7 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CCC1FCA9EC7
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2019 04:28:11 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 5847920818
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2019 04:28:11 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5847920818
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id E06DD6B0003; Mon,  4 Nov 2019 23:28:10 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D8FEF6B0006; Mon,  4 Nov 2019 23:28:10 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C2F7E6B0007; Mon,  4 Nov 2019 23:28:10 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83])
	by kanga.kvack.org (Postfix) with ESMTP id A63686B0003
	for <linux-mm@kvack.org>; Mon,  4 Nov 2019 23:28:10 -0500 (EST)
Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with SMTP id 4D96C181AEF0B
	for <linux-mm@kvack.org>; Tue,  5 Nov 2019 04:28:10 +0000 (UTC)
X-FDA: 76120941540.19.field63_ce6369531145
X-HE-Tag: field63_ce6369531145
X-Filterd-Recvd-Size: 6946
Received: from r3-25.sinamail.sina.com.cn (r3-25.sinamail.sina.com.cn [202.108.3.25])
	by imf06.hostedemail.com (Postfix) with SMTP
	for <linux-mm@kvack.org>; Tue,  5 Nov 2019 04:28:08 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([221.219.0.223])
	by sina.com with ESMTP
	id 5DC0FA54000200EB; Tue, 5 Nov 2019 12:28:06 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 36272154919647
From: Hillf Danton <hdanton@sina.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>,
	John Hubbard <jhubbard@nvidia.com>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Jan Kara <jack@suse.cz>,
	Mel Gorman <mgorman@suse.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Christoph Hellwig <hch@lst.de>,
	Jonathan Corbet <corbet@lwn.net>
Subject: Re: [RFC] mm: gup: add helper page_try_gup_pin(page)
Date: Tue,  5 Nov 2019 12:27:55 +0800
Message-Id: <20191105042755.7292-1-hdanton@sina.com>
In-Reply-To: <20191104102050.15988-1-hdanton@sina.com>
References: <20191103112113.8256-1-hdanton@sina.com> <20191104043420.15648-1-hdanton@sina.com> <20191104102050.15988-1-hdanton@sina.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On Mon, 4 Nov 2019 14:03:55 -0500 Jerome Glisse wrote:
>=20
> On Mon, Nov 04, 2019 at 06:20:50PM +0800, Hillf Danton wrote:
> >
> > On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> > > On 11/3/19 8:34 PM, Hillf Danton wrote:
> > > ...
> > > >>
> > > >> Well, as long as we're counting bits, I've taken 21 bits (!) to =
track
> > > >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page=
()...please
> > > >
> > > > Would you please specify the reasoning of tracking multiple guper=
s
> > > > for a dirty page? Do you mean that it is all fine for guper-A to =
add
> > > > changes to guper-B's data without warning and vice versa?
> > >
> > > It's generally OK to call get_user_pages() on a page more than once=
.
> >
> > Does this explain that it's generally OK to gup pin a page under
> > writeback and then start DMA to it behind the flusher's back without
> > warning?
>=20
> It can happens today, is it ok ... well no but we live in an imperfect
> world. GUP have been abuse by few device driver over the years and thos=
e
> never checked what it meant to use it so now we are left with existing
> device driver that we can not break that do wrong thing.

See your point :)

> I personaly think that we should use bounce page for writeback so that
> writeback can still happens if a page is GUPed.

Gup can be prevented from falling foul of writeback IMHO if the page
under writeback, gup pinned or not, remains stable until it is no
longer dirty.

For that stability, either we can check PageWriteback on gup pinning
for instance as the RFC does or drivers can set a gup-pinned page
dirty only after DMA and start no more DMA until it is clean again.

As long as that stability is ensured writeback will no longer need to
take care of gup pin, long-lived or transient.

It seems unlike a request too strict to meet in practice wrt data
corruption, and bounce page for writeback sounds promising. Does it
need to do a memory copy?

> John's patchset is the
> first step to be able to identify GUPed page and maybe special case the=
m.
> >
> > > And even though we are seeing some work to reduce the number of pla=
ces
> > > in the kernel that call get_user_pages(), there are still lots of c=
all sites.
> > > That means lots of combinations and situations that could result in=
 more
> > > than one gup call per page.
> > >
> > > Furthermore, there is no mechanism, convention, documentation, nor =
anything
> > > at all that attempts to enforce "for each page, get_user_pages() ma=
y only
> > > be called once."
> >
> > What sense is this making wrt the data corruption resulting specifica=
lly
> > from multiple gup references?
>=20
> Multiple GUP references do not imply corruption. Only one or more devic=
es
> writing to the page while writeback is happening is a cause of corrupti=
on.
> Multiple device writting in the same page concurrently is like multiple
> CPU thread doing the same. Either the application/device drivers are do=
ing
> this rightfully on purpose or the application has a bug. Either way it =
is
> not our problem (note here i am talking about userspace portion of the
> device driver).
>=20
>=20
> > > >> I think you must have missed the many contentious debates about =
the
> > > >> tension between gup-pinned pages, and writeback. File systems ca=
n't
> > > >> just ignore writeback in all cases. This patch leads to either
> > > >> system hangs or filesystem corruption, in the presence of long-l=
asting
> > > >> gup pins.
> > > >
> > > > The current risk of data corruption due to writeback with long-li=
ved
> > > > gup references all ignored is zeroed out by detecting gup-pinned =
dirty
> > > > pages and skipping them; that may lead to problems you mention ab=
ove.
> > > >
> > >
> > > Here, I believe you're pointing out that the current situation in t=
he
> > > kernel is already broken, with respect to fs interactions (especial=
ly
> > > writeback) with gup. Yes, you are correct, there is a problem.
> > >
> > > > Though I doubt anything helpful about it can be expected from fs =
in near
> > >
> > > Actually, fs and mm folks are working together to solve this.
> > >
> > > > future, we have options for instance that gupers periodically rel=
ease
> > > > their references and re-pin pages after data sync the same way as=
 the
> > > > current flusher does.
> > > >
> > >
> > > That's one idea. I don't see it as viable, given the behavior of, s=
ay,
> > > a compute process running OpenCL jobs on a GPU that is connected vi=
a
> > > a network or Infiniband card--the idea of "pause" really looks more=
 like
> > > "tear down the complicated multi-driver connection, writeback, then=
 set it
> > > all up again", I suspect. (And if we could easily interrupt the job=
, we'd
> > > probably really be running with a page-fault-capable GPU plus and I=
B card
> > > that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...)
> >
> > Well is it OK to shorten the behavior above to "data corruption in
> > writeback is tolerable in practice because of the expensive cost of
> > data sync"?
> >
> > What is the point of writeback? Why can the writeback of long-lived
> > gup-pinned pages not be skipped while data sync can be entirely
> > ignored?
>=20
> I do not think we want that (skip writeback on GUPed page). I think wha=
t
> we should do is use a bounce page ie take a snapshot of the page and
> starts writeback with the snapshot. We need a snapshot because fs code
> expect stable page content for things like encryption or hashing or oth=
er
> crazy fs features :)