From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=uFPt=Y4=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.7 required=3.0 tests=FREEMAIL_FORGED_FROMDOMAIN,
	FREEMAIL_FROM,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id F36F0CA9EA0
	for <linux-mm@archiver.kernel.org>; Mon,  4 Nov 2019 10:21:05 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A028B2184C
	for <linux-mm@archiver.kernel.org>; Mon,  4 Nov 2019 10:21:05 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A028B2184C
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=sina.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 3CE736B0005; Mon,  4 Nov 2019 05:21:05 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 37E4D6B0006; Mon,  4 Nov 2019 05:21:05 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 275B66B0007; Mon,  4 Nov 2019 05:21:05 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97])
	by kanga.kvack.org (Postfix) with ESMTP id 0E7B86B0005
	for <linux-mm@kvack.org>; Mon,  4 Nov 2019 05:21:05 -0500 (EST)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with SMTP id 939EE2C06
	for <linux-mm@kvack.org>; Mon,  4 Nov 2019 10:21:04 +0000 (UTC)
X-FDA: 76118202048.11.crack92_7d9e88f64d313
X-HE-Tag: crack92_7d9e88f64d313
X-Filterd-Recvd-Size: 4900
Received: from r3-20.sinamail.sina.com.cn (r3-20.sinamail.sina.com.cn [202.108.3.20])
	by imf15.hostedemail.com (Postfix) with SMTP
	for <linux-mm@kvack.org>; Mon,  4 Nov 2019 10:21:02 +0000 (UTC)
Received: from unknown (HELO localhost.localdomain)([221.219.0.223])
	by sina.com with ESMTP
	id 5DBFFB8900016745; Mon, 4 Nov 2019 18:21:00 +0800 (CST)
X-Sender: hdanton@sina.com
X-Auth-ID: hdanton@sina.com
X-SMAIL-MID: 56497215074395
From: Hillf Danton <hdanton@sina.com>
To: John Hubbard <jhubbard@nvidia.com>
Cc: Hillf Danton <hdanton@sina.com>,
	linux-mm <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Jan Kara <jack@suse.cz>,
	Mel Gorman <mgorman@suse.de>,
	Jerome Glisse <jglisse@redhat.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Ira Weiny <ira.weiny@intel.com>,
	Christoph Hellwig <hch@lst.de>,
	Jonathan Corbet <corbet@lwn.net>
Subject: Re: [RFC] mm: gup: add helper page_try_gup_pin(page)
Date: Mon,  4 Nov 2019 18:20:50 +0800
Message-Id: <20191104102050.15988-1-hdanton@sina.com>
In-Reply-To: <20191104043420.15648-1-hdanton@sina.com>
References: <20191103112113.8256-1-hdanton@sina.com> <20191104043420.15648-1-hdanton@sina.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On Sun, 3 Nov 2019 22:09:03 -0800 John Hubbard wrote:
> On 11/3/19 8:34 PM, Hillf Danton wrote:
> ...
> >>
> >> Well, as long as we're counting bits, I've taken 21 bits (!) to trac=
k
> >> "gupers". :)  More accurately, I'm sharing 31 bits with get_page()..=
.please
> >=20
> > Would you please specify the reasoning of tracking multiple gupers
> > for a dirty page? Do you mean that it is all fine for guper-A to add
> > changes to guper-B's data without warning and vice versa?
>=20
> It's generally OK to call get_user_pages() on a page more than once.

Does this explain that it's generally OK to gup pin a page under
writeback and then start DMA to it behind the flusher's back without
warning?=20

> And even though we are seeing some work to reduce the number of places
> in the kernel that call get_user_pages(), there are still lots of call =
sites.
> That means lots of combinations and situations that could result in mor=
e
> than one gup call per page.
>=20
> Furthermore, there is no mechanism, convention, documentation, nor anyt=
hing
> at all that attempts to enforce "for each page, get_user_pages() may on=
ly
> be called once."

What sense is this making wrt the data corruption resulting specifically
from multiple gup references?

>=20
> ...
> >>
> >> I think you must have missed the many contentious debates about the
> >> tension between gup-pinned pages, and writeback. File systems can't
> >> just ignore writeback in all cases. This patch leads to either
> >> system hangs or filesystem corruption, in the presence of long-lasti=
ng
> >> gup pins.
> >=20
> > The current risk of data corruption due to writeback with long-lived
> > gup references all ignored is zeroed out by detecting gup-pinned dirt=
y
> > pages and skipping them; that may lead to problems you mention above.
> >=20
>=20
> Here, I believe you're pointing out that the current situation in the
> kernel is already broken, with respect to fs interactions (especially
> writeback) with gup. Yes, you are correct, there is a problem.
>=20
> > Though I doubt anything helpful about it can be expected from fs in n=
ear
>=20
> Actually, fs and mm folks are working together to solve this.
>=20
> > future, we have options for instance that gupers periodically release
> > their references and re-pin pages after data sync the same way as the
> > current flusher does.
> >=20
>=20
> That's one idea. I don't see it as viable, given the behavior of, say,
> a compute process running OpenCL jobs on a GPU that is connected via
> a network or Infiniband card--the idea of "pause" really looks more lik=
e
> "tear down the complicated multi-driver connection, writeback, then set=
 it
> all up again", I suspect. (And if we could easily interrupt the job, we=
'd
> probably really be running with a page-fault-capable GPU plus and IB ca=
rd
> that does ODP, plus HMM, and we wouldn't need to gup-pin anyway...)

Well is it OK to shorten the behavior above to "data corruption in
writeback is tolerable in practice because of the expensive cost of
data sync"?

What is the point of writeback? Why can the writeback of long-lived
gup-pinned pages not be skipped while data sync can be entirely
ignored?

> Anyway, this is not amenable to quick fixes, because the problem is
> a couple of missing design pieces. Which we're working on putting in.
> But meanwhile, smaller changes such as this one are just going to move
> the problems to different places, rather than solving them. So it's bes=
t
> not to do that.
>=20
> thanks,
> --=20
> John Hubbard
> NVIDIA