From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7FEFC83004 for ; Tue, 28 Apr 2020 22:02:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A3D2720775 for ; Tue, 28 Apr 2020 22:02:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A3D2720775 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=fromorbit.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2069E8E0005; Tue, 28 Apr 2020 18:02:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B7438E0001; Tue, 28 Apr 2020 18:02:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0CD3A8E0005; Tue, 28 Apr 2020 18:02:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0178.hostedemail.com [216.40.44.178]) by kanga.kvack.org (Postfix) with ESMTP id E82DB8E0001 for ; Tue, 28 Apr 2020 18:02:38 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A4BE4181AEF09 for ; Tue, 28 Apr 2020 22:02:38 +0000 (UTC) X-FDA: 76758638796.03.bone10_341338855ea34 X-HE-Tag: bone10_341338855ea34 X-Filterd-Recvd-Size: 7903 Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Tue, 28 Apr 2020 22:02:37 +0000 (UTC) Received: from dread.disaster.area (pa49-195-157-175.pa.nsw.optusnet.com.au [49.195.157.175]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 3E9873A45A1; Wed, 29 Apr 2020 08:02:33 +1000 (AEST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1jTYJE-0008Uv-5w; Wed, 29 Apr 2020 08:02:32 +1000 Date: Wed, 29 Apr 2020 08:02:32 +1000 From: Dave Chinner To: "Darrick J. Wong" Cc: Matthew Wilcox , Ruan Shiyang , "linux-kernel@vger.kernel.org" , "linux-xfs@vger.kernel.org" , "linux-nvdimm@lists.01.org" , "linux-mm@kvack.org" , "linux-fsdevel@vger.kernel.org" , "dan.j.williams@intel.com" , "hch@lst.de" , "rgoldwyn@suse.de" , "Qi, Fuli" , "Gotou, Yasunori" Subject: Re: =?utf-8?B?5Zue5aSNOiBSZQ==?= =?utf-8?Q?=3A?= [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Message-ID: <20200428220232.GI2040@dread.disaster.area> References: <20200427084750.136031-1-ruansy.fnst@cn.fujitsu.com> <20200427122836.GD29705@bombadil.infradead.org> <20200428064318.GG2040@dread.disaster.area> <259fe633-e1ff-b279-cd8c-1a81eaa40941@cn.fujitsu.com> <20200428111636.GK29705@bombadil.infradead.org> <20200428112441.GH2040@dread.disaster.area> <20200428153732.GZ6742@magnolia> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20200428153732.GZ6742@magnolia> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=QIgWuTDL c=1 sm=1 tr=0 a=ONQRW0k9raierNYdzxQi9Q==:117 a=ONQRW0k9raierNYdzxQi9Q==:17 a=IkcTkHD0fZMA:10 a=cl8xLZFz6L8A:10 a=5KLPUuaC_9wA:10 a=JfrnYn6hAAAA:8 a=7-415B0cAAAA:8 a=l6wd5GMc4HtCNSAcdtkA:9 a=z4jTvAT5gXS7p1mQ:21 a=_k9EnfUi_P5Muxk2:21 a=QEXdDO2ut3YA:10 a=1CNFftbPRP8L7MoqJWF3:22 a=biEYGPWJfzWAr4FL6Ov7:22 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote: > On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote: > > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > > > > On 2020/4/28 =E4=B8=8B=E5=8D=882:43, Dave Chinner wrote: > > > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > > > > > =E5=9C=A8 2020/4/27 20:28:36, "Matthew Wilcox" =E5=86=99=E9=81=93: > > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrot= e: > > > > > > > > This patchset is a try to resolve the shared 'page cach= e' problem for > > > > > > > > fsdax. > > > > > > > >=20 > > > > > > > > In order to track multiple mappings and indexes on one = page, I > > > > > > > > introduced a dax-rmap rb-tree to manage the relationshi= p. A dax entry > > > > > > > > will be associated more than once if is shared. At the= second time we > > > > > > > > associate this entry, we create this rb-tree and store = its root in > > > > > > > > page->private(not used in fsdax). Insert (->mapping, -= >index) when > > > > > > > > dax_associate_entry() and delete it when dax_disassocia= te_entry(). > > > > > > >=20 > > > > > > > Do we really want to track all of this on a per-page basis?= I would > > > > > > > have thought a per-extent basis was more useful. Essential= ly, create > > > > > > > a new address_space for each shared extent. Per page just = seems like > > > > > > > a huge overhead. > > > > > > >=20 > > > > > > Per-extent tracking is a nice idea for me. I haven't thought= of it > > > > > > yet... > > > > > >=20 > > > > > > But the extent info is maintained by filesystem. I think we = need a way > > > > > > to obtain this info from FS when associating a page. May be = a bit > > > > > > complicated. Let me think about it... > > > > >=20 > > > > > That's why I want the -user of this association- to do a filesy= stem > > > > > callout instead of keeping it's own naive tracking infrastructu= re. > > > > > The filesystem can do an efficient, on-demand reverse mapping l= ookup > > > > > from it's own extent tracking infrastructure, and there's zero > > > > > runtime overhead when there are no errors present. > > > > >=20 > > > > > At the moment, this "dax association" is used to "report" a sto= rage > > > > > media error directly to userspace. I say "report" because what = it > > > > > does is kill userspace processes dead. The storage media error > > > > > actually needs to be reported to the owner of the storage media= , > > > > > which in the case of FS-DAX is the filesytem. > > > >=20 > > > > Understood. > > > >=20 > > > > BTW, this is the usage in memory-failure, so what about rmap? I = have not > > > > found how to use this tracking in rmap. Do you have any ideas? > > > >=20 > > > > >=20 > > > > > That way the filesystem can then look up all the owners of that= bad > > > > > media range (i.e. the filesystem block it corresponds to) and t= ake > > > > > appropriate action. e.g. > > > >=20 > > > > I tried writing a function to look up all the owners' info of one= block in > > > > xfs for memory-failure use. It was dropped in this patchset beca= use I found > > > > out that this lookup function needs 'rmapbt' to be enabled when m= kfs. But > > > > by default, rmapbt is disabled. I am not sure if it matters... > > >=20 > > > I'm pretty sure you can't have shared extents on an XFS filesystem = if you > > > _don't_ have the rmapbt feature enabled. I mean, that's why it exi= sts. > >=20 > > You're confusing reflink with rmap. :) > >=20 > > rmapbt does all the reverse mapping tracking, reflink just does the > > shared data extent tracking. > >=20 > > But given that anyone who wants to use DAX with reflink is going to > > have to mkfs their filesystem anyway (to turn on reflink) requiring > > that rmapbt is also turned on is not a big deal. Especially as we > > can check it at mount time in the kernel... >=20 > Are we going to turn on rmap by default? The last I checked, it did > have a 10-20% performance cost on extreme metadata-heavy workloads. > Or do we only enable it by default if mkfs detects a pmem device? Just have the kernel refuse to mount a reflink enabled filesystem on a DAX capable device unless -o dax=3Dnever or rmapbt is enabled. That'll get the message across pretty quickly.... > (Admittedly, most people do not run fsx as a productivity app; the > normal hit is usually 3-5% which might not be such a big deal since you > also get (half of) online fsck. :P) I have not noticed the overhead at all on any of my production machines since I enabled it way on all of them way back when.... And, really, pmem is a _very poor choice_ for metadata intensive applications on XFS as pmem is completely synchronous. XFS has an async IO model for it's metadata that *must* be buffered (so no DAX!) and the synchronous nature of pmem completely defeats the architectural IO pipelining XFS uses to allow thousands of concurrent metadata IOs in flight. OTOH, pmem IO depth is limited to the number of CPUs that are concurrently issuing IO, so it really, really sucks compared to a handful of high end nvme SSDs on PCIe 4.0.... So with that in mind, I see little reason to care about the small additional overhead of rmapbt on FS-DAX installations that require reflink... Cheers, Dave. --=20 Dave Chinner david@fromorbit.com