From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72A99C433E0 for ; Thu, 14 Jan 2021 17:20:23 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B9B1223B46 for ; Thu, 14 Jan 2021 17:20:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B9B1223B46 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F148C8D00F4; Thu, 14 Jan 2021 12:20:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EC5118D00F0; Thu, 14 Jan 2021 12:20:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB5788D00F4; Thu, 14 Jan 2021 12:20:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0176.hostedemail.com [216.40.44.176]) by kanga.kvack.org (Postfix) with ESMTP id C4EEF8D00F0 for ; Thu, 14 Jan 2021 12:20:21 -0500 (EST) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 7D7232494 for ; Thu, 14 Jan 2021 17:20:21 +0000 (UTC) X-FDA: 77705044242.05.man87_1d00be427528 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin05.hostedemail.com (Postfix) with ESMTP id 58DD5180F0812 for ; Thu, 14 Jan 2021 17:20:21 +0000 (UTC) X-HE-Tag: man87_1d00be427528 X-Filterd-Recvd-Size: 12647 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Thu, 14 Jan 2021 17:20:20 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id 7E15E23B40; Thu, 14 Jan 2021 17:20:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1610644819; bh=EXEnEyfrCTN4jcEZWerLll4giavU5clJRT6dCZzDHCA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=GVeXaOthgIZOKxCv7F+b46TPkftBwC5bV446IekhtxRZLhEx7swp/lUBa7+DhSA5V MQvYC5e4TRQqTDd+DxRN7/LSB9RaxJ+fdDFCSs/YueQmRro1YJM9S+fm1LgGuS0ZRF Xwo1OitruNLkLk4MVMknBBxCZRAP0BvLI3zEZf2Ya42+0z2TSUjDSt96+i9LQJm14g 5pnyQKnWo/3d4+7KgazxfEzwl2/0qzYvkSX0cRb/FSrrrFRDyI770iMSBIsVlINQ2z C67YwS0dBC04X8Hdq1lnU5gqGHwRwtSWdTxhl6/ef+y/ECEKyiNbz6QEdYKzMsOjWY 6j6heUhkhjv4Q== Date: Thu, 14 Jan 2021 09:20:18 -0800 From: "Darrick J. Wong" To: zhong jiang Cc: Ruan Shiyang , Jan Kara , linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, darrick.wong@oracle.com, dan.j.williams@intel.com, david@fromorbit.com, hch@lst.de, song@kernel.org, rgoldwyn@suse.de, qi.fuli@fujitsu.com, y-goto@fujitsu.com Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping Message-ID: <20210114172018.GZ1164246@magnolia> References: <20201230165601.845024-1-ruansy.fnst@cn.fujitsu.com> <20201230165601.845024-5-ruansy.fnst@cn.fujitsu.com> <20210106154132.GC29271@quack2.suse.cz> <75164044-bfdf-b2d6-dff0-d6a8d56d1f62@cn.fujitsu.com> <781f276b-afdd-091c-3dba-048e415431ab@linux.alibaba.com> <4f184987-3cc2-c72d-0774-5d20ea2e1d49@cn.fujitsu.com> <53ecb7e2-8f59-d1a5-df75-4780620ce91f@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <53ecb7e2-8f59-d1a5-df75-4780620ce91f@linux.alibaba.com> Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jan 14, 2021 at 05:38:33PM +0800, zhong jiang wrote: >=20 > On 2021/1/14 11:52 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: > >=20 > >=20 > > On 2021/1/14 =E4=B8=8A=E5=8D=8811:26, zhong jiang wrote: > > >=20 > > > On 2021/1/14 9:44 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: > > > >=20 > > > >=20 > > > > On 2021/1/13 =E4=B8=8B=E5=8D=886:04, zhong jiang wrote: > > > > >=20 > > > > > On 2021/1/12 10:55 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: > > > > > >=20 > > > > > >=20 > > > > > > On 2021/1/6 =E4=B8=8B=E5=8D=8811:41, Jan Kara wrote: > > > > > > > On Thu 31-12-20 00:55:55, Shiyang Ruan wrote: > > > > > > > > The current memory_failure_dev_pagemap() can > > > > > > > > only handle single-mapped > > > > > > > > dax page for fsdax mode.=C2=A0 The dax page could be > > > > > > > > mapped by multiple files > > > > > > > > and offsets if we let reflink feature & fsdax > > > > > > > > mode work together. So, > > > > > > > > we refactor current implementation to support > > > > > > > > handle memory failure on > > > > > > > > each file and offset. > > > > > > > >=20 > > > > > > > > Signed-off-by: Shiyang Ruan > > > > > > >=20 > > > > > > > Overall this looks OK to me, a few comments below. > > > > > > >=20 > > > > > > > > --- > > > > > > > > =C2=A0 fs/dax.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 | 21 +++++++++++ > > > > > > > > =C2=A0 include/linux/dax.h |=C2=A0 1 + > > > > > > > > =C2=A0 include/linux/mm.h=C2=A0 |=C2=A0 9 +++++ > > > > > > > > =C2=A0 mm/memory-failure.c | 91 > > > > > > > > ++++++++++++++++++++++++++++++++++----------- > > > > > > > > =C2=A0 4 files changed, 100 insertions(+), 22 deletions(-= ) > > > > > >=20 > > > > > > ... > > > > > >=20 > > > > > > > > =C2=A0 @@ -345,9 +348,12 @@ static void > > > > > > > > add_to_kill(struct task_struct *tsk, struct page > > > > > > > > *p, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } > > > > > > > > =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->addr =3D page_a= ddress_in_vma(p, vma); > > > > > > > > -=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shif= t =3D dev_pagemap_mapping_shift(p, vma); > > > > > > > > -=C2=A0=C2=A0=C2=A0 else > > > > > > > > +=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) { > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_device= _fsdax_page(p)) > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 tk->addr =3D vma->vm_start + > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ((pgoff - vma->= vm_pgoff) << PAGE_SHIFT); > > > > > > >=20 > > > > > > > It seems strange to use 'pgoff' for dax pages and > > > > > > > not for any other page. > > > > > > > Why? I'd rather pass correct pgoff from all callers > > > > > > > of add_to_kill() and > > > > > > > avoid this special casing... > > > > > >=20 > > > > > > Because one fsdax page can be shared by multiple pgoffs. > > > > > > I have to pass each pgoff in each iteration to calculate > > > > > > the address in vma (for tk->addr).=C2=A0 Other kinds of pages > > > > > > don't need this. They can get their unique address by > > > > > > calling "page_address_in_vma()". > > > > > >=20 > > > > > IMO,=C2=A0=C2=A0 an fsdax page can be shared by multiple files = rather > > > > > than multiple pgoffs if fs query support reflink.=C2=A0=C2=A0 B= ecause > > > > > an page only located in an mapping(page->mapping is > > > > > exclusive), hence it=C2=A0 only has an pgoff or index pointing = at > > > > > the node. > > > > >=20 > > > > > =C2=A0=C2=A0or=C2=A0 I miss something for the feature ?=C2=A0 t= hanks, > > > >=20 > > > > Yes, a fsdax page is shared by multiple files because of > > > > reflink. I think my description of 'pgoff' here is not correct.=C2= =A0 > > > > This 'pgoff' means the offset within the a file. (We use rmap to > > > > find out all the sharing files and their offsets.)=C2=A0 So, I sa= id > > > > that "can be shared by multiple pgoffs".=C2=A0 It's my bad. > > > >=20 > > > > I think I should name it another word to avoid misunderstandings. > > > >=20 > > > IMO,=C2=A0 All the sharing files should be the same offset to share= the > > > fsdax page.=C2=A0 why not that ? > >=20 > > The dedupe operation can let different files share their same data > > extent, though offsets are not same.=C2=A0 So, files can share one fs= dax page > > at different offset. > Ok,=C2=A0 Get it. > >=20 > > > As you has said,=C2=A0 a shared fadax page should be inserted to > > > different mapping files.=C2=A0 but page->index and page->mapping is > > > exclusive.=C2=A0 hence an page only should be placed in an mapping = tree. > >=20 > > We can't use page->mapping and page->index here for reflink & fsdax. = And > > that's this patchset aims to solve.=C2=A0 I introduced a series of > > ->corrupted_range(), from mm to pmem driver to block device and final= ly > > to filesystem, to use rmap feature of filesystem to find out all file= s > > sharing same data extent (fsdax page). >=20 > From this patch,=C2=A0 each file has mapping tree,=C2=A0 the shared pag= e will be > inserted into multiple file mapping tree.=C2=A0 then filesystem use fil= e and > offset to get the killed process.=C2=A0=C2=A0 Is it correct? FWIW I thought the purpose of this patchset is to remove the (dax) memory poison code's reliance on the pagecache mapping structure by pushing poison notifications directly into the filesystem and letting the filesystem perform reverse lookup operations to figure out which file(s) have gone bad, and using the file list to call back into the mm to kill processes. Once that's done, I think(?) that puts us significantly closer to being able to share pmem between files in dax mode without having to rewrite the entire memory manager's mapping and rmapping code to support sharing. --D > Thanks, >=20 > >=20 > >=20 > > --=20 > > Thanks, > > Ruan Shiyang. > >=20 > > >=20 > > > And In the current patch,=C2=A0 we failed to found out that all pro= cess > > > use the fsdax page shared by multiple files and kill them. > > >=20 > > >=20 > > > Thanks, > > >=20 > > > > --=20 > > > > Thanks, > > > > Ruan Shiyang. > > > >=20 > > > > >=20 > > > > > > So, I added this fsdax case here. This patchset only > > > > > > implemented the fsdax case, other cases also need to be > > > > > > added here if to be implemented. > > > > > >=20 > > > > > >=20 > > > > > > --=20 > > > > > > Thanks, > > > > > > Ruan Shiyang. > > > > > >=20 > > > > > > >=20 > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shif= t =3D > > > > > > > > dev_pagemap_mapping_shift(p, vma, tk->addr); > > > > > > > > +=C2=A0=C2=A0=C2=A0 } else > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk= ->size_shift =3D page_shift(compound_head(p)); > > > > > > > > =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* > > > > > > > > @@ -495,7 +501,7 @@ static void > > > > > > > > collect_procs_anon(struct page *page, struct > > > > > > > > list_head *to_kill, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (!page_mapped_in_vma(page, vma)) > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 continue; > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, NULL, 0, vma, to_k= ill); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock= ); > > > > > > > > @@ -505,24 +511,19 @@ static void > > > > > > > > collect_procs_anon(struct page *page, struct > > > > > > > > list_head *to_kill, > > > > > > > > =C2=A0 /* > > > > > > > > =C2=A0=C2=A0 * Collect processes when the error hit a fil= e mapped page. > > > > > > > > =C2=A0=C2=A0 */ > > > > > > > > -static void collect_procs_file(struct page > > > > > > > > *page, struct list_head *to_kill, > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 int force_early) > > > > > > > > +static void collect_procs_file(struct page > > > > > > > > *page, struct address_space *mapping, > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff_t pgoff= , struct list_head *to_kill, int force_early) > > > > > > > > =C2=A0 { > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma= ; > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task_struct *tsk; > > > > > > > > -=C2=A0=C2=A0=C2=A0 struct address_space *mapping =3D pag= e->mapping; > > > > > > > > -=C2=A0=C2=A0=C2=A0 pgoff_t pgoff; > > > > > > > > =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i_mmap_lock_read(ma= pping); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_lock(&tasklist_lock); > > > > > > > > -=C2=A0=C2=A0=C2=A0 pgoff =3D page_to_pgoff(page); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_process(tsk) { > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 st= ruct task_struct *t =3D > > > > > > > > task_early_kill(tsk, force_early); > > > > > > > > - > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if= (!t) > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 continue; > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_= tree_foreach(vma, &mapping->i_mmap, pgoff, > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgo= ff) { > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_= tree_foreach(vma, > > > > > > > > &mapping->i_mmap, pgoff, pgoff) { > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 /* > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 * Send early kill signal to tasks where a vma= covers > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 * the page but the corrupted page is not nece= ssarily > > > > > > > > @@ -531,7 +532,7 @@ static void > > > > > > > > collect_procs_file(struct page *page, struct > > > > > > > > list_head *to_kill, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 * to be informed of all such data corruptions= . > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 */ > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, mapping, > > > > > > > > pgoff, vma, to_kill); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock= ); > > > > > > > > @@ -550,7 +551,8 @@ static void > > > > > > > > collect_procs(struct page *page, struct > > > > > > > > list_head *tokill, > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page)) > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 co= llect_procs_anon(page, tokill, force_early); > > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else > > > > > > > > -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs= _file(page, tokill, force_early); > > > > > > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs= _file(page, page->mapping, > > > > > > > > page_to_pgoff(page), > > > > > > >=20 > > > > > > > Why not use page_mapping() helper here? It would be > > > > > > > safer for THPs if they > > > > > > > ever get here... > > > > > > >=20 > > > > > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Honza > > > > > > >=20 > > > > > >=20 > > > > >=20 > > > > >=20 > > > >=20 > > >=20 > > >=20 > >=20