From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C5BDC433DB for ; Thu, 14 Jan 2021 03:26:48 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 66D1523719 for ; Thu, 14 Jan 2021 03:26:47 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 66D1523719 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A0D608D00B6; Wed, 13 Jan 2021 22:26:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9959F8D008E; Wed, 13 Jan 2021 22:26:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85CAB8D00B6; Wed, 13 Jan 2021 22:26:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0194.hostedemail.com [216.40.44.194]) by kanga.kvack.org (Postfix) with ESMTP id 6D36E8D008E for ; Wed, 13 Jan 2021 22:26:46 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 394F3181AEF31 for ; Thu, 14 Jan 2021 03:26:46 +0000 (UTC) X-FDA: 77702943612.27.swing28_5d05c8827523 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin27.hostedemail.com (Postfix) with ESMTP id 1FCB43D663 for ; Thu, 14 Jan 2021 03:26:46 +0000 (UTC) X-HE-Tag: swing28_5d05c8827523 X-Filterd-Recvd-Size: 9015 Received: from out30-44.freemail.mail.aliyun.com (out30-44.freemail.mail.aliyun.com [115.124.30.44]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Thu, 14 Jan 2021 03:26:43 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04426;MF=zhongjiang-ali@linux.alibaba.com;NM=1;PH=DS;RN=16;SR=0;TI=SMTPD_---0ULg9Uqv_1610594791; Received: from L-X1DSLVDL-1420.local(mailfrom:zhongjiang-ali@linux.alibaba.com fp:SMTPD_---0ULg9Uqv_1610594791) by smtp.aliyun-inc.com(127.0.0.1); Thu, 14 Jan 2021 11:26:32 +0800 Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping To: Ruan Shiyang , Jan Kara Cc: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-nvdimm@lists.01.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, darrick.wong@oracle.com, dan.j.williams@intel.com, david@fromorbit.com, hch@lst.de, song@kernel.org, rgoldwyn@suse.de, qi.fuli@fujitsu.com, y-goto@fujitsu.com References: <20201230165601.845024-1-ruansy.fnst@cn.fujitsu.com> <20201230165601.845024-5-ruansy.fnst@cn.fujitsu.com> <20210106154132.GC29271@quack2.suse.cz> <75164044-bfdf-b2d6-dff0-d6a8d56d1f62@cn.fujitsu.com> <781f276b-afdd-091c-3dba-048e415431ab@linux.alibaba.com> From: zhong jiang Message-ID: Date: Thu, 14 Jan 2021 11:26:31 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:85.0) Gecko/20100101 Thunderbird/85.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2021/1/14 9:44 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: > > > On 2021/1/13 =E4=B8=8B=E5=8D=886:04, zhong jiang wrote: >> >> On 2021/1/12 10:55 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: >>> >>> >>> On 2021/1/6 =E4=B8=8B=E5=8D=8811:41, Jan Kara wrote: >>>> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote: >>>>> The current memory_failure_dev_pagemap() can only handle=20 >>>>> single-mapped >>>>> dax page for fsdax mode.=C2=A0 The dax page could be mapped by mult= iple=20 >>>>> files >>>>> and offsets if we let reflink feature & fsdax mode work together.=C2= =A0=20 >>>>> So, >>>>> we refactor current implementation to support handle memory=20 >>>>> failure on >>>>> each file and offset. >>>>> >>>>> Signed-off-by: Shiyang Ruan >>>> >>>> Overall this looks OK to me, a few comments below. >>>> >>>>> --- >>>>> =C2=A0 fs/dax.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 | 21 +++++++++++ >>>>> =C2=A0 include/linux/dax.h |=C2=A0 1 + >>>>> =C2=A0 include/linux/mm.h=C2=A0 |=C2=A0 9 +++++ >>>>> =C2=A0 mm/memory-failure.c | 91=20 >>>>> ++++++++++++++++++++++++++++++++++----------- >>>>> =C2=A0 4 files changed, 100 insertions(+), 22 deletions(-) >>> >>> ... >>> >>>>> =C2=A0 @@ -345,9 +348,12 @@ static void add_to_kill(struct task_str= uct=20 >>>>> *tsk, struct page *p, >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->addr =3D page_address_in_= vma(p, vma); >>>>> -=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D dev_= pagemap_mapping_shift(p, vma); >>>>> -=C2=A0=C2=A0=C2=A0 else >>>>> +=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_device_fsdax_pag= e(p)) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= tk->addr =3D vma->vm_start + >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ((pgoff - vma->vm_pgoff)= << PAGE_SHIFT); >>>> >>>> It seems strange to use 'pgoff' for dax pages and not for any other=20 >>>> page. >>>> Why? I'd rather pass correct pgoff from all callers of=20 >>>> add_to_kill() and >>>> avoid this special casing... >>> >>> Because one fsdax page can be shared by multiple pgoffs.=C2=A0 I have= to=20 >>> pass each pgoff in each iteration to calculate the address in vma=20 >>> (for tk->addr).=C2=A0 Other kinds of pages don't need this. They can = get=20 >>> their unique address by calling "page_address_in_vma()". >>> >> IMO,=C2=A0=C2=A0 an fsdax page can be shared by multiple files rather = than=20 >> multiple pgoffs if fs query support reflink.=C2=A0=C2=A0 Because an pa= ge only=20 >> located in an mapping(page->mapping is exclusive), hence it=C2=A0 only= has=20 >> an pgoff or index pointing at the node. >> >> =C2=A0=C2=A0or=C2=A0 I miss something for the feature ?=C2=A0 thanks, > > Yes, a fsdax page is shared by multiple files because of reflink. I=20 > think my description of 'pgoff' here is not correct.=C2=A0 This 'pgoff'= =20 > means the offset within the a file.=C2=A0 (We use rmap to find out all = the=20 > sharing files and their offsets.)=C2=A0 So, I said that "can be shared = by=20 > multiple pgoffs".=C2=A0 It's my bad. > > I think I should name it another word to avoid misunderstandings. > IMO,=C2=A0 All the sharing files should be the same offset to share the f= sdax=20 page.=C2=A0 why not that ?=C2=A0 As you has said,=C2=A0 a shared fadax pa= ge should be=20 inserted to different mapping files.=C2=A0 but page->index and page->mapp= ing=20 is exclusive.=C2=A0 hence an page only should be placed in an mapping tre= e. And In the current patch,=C2=A0 we failed to found out that all process u= se=20 the fsdax page shared by multiple files and kill them. Thanks, > --=20 > Thanks, > Ruan Shiyang. > >> >>> So, I added this fsdax case here.=C2=A0 This patchset only implemente= d=20 >>> the fsdax case, other cases also need to be added here if to be=20 >>> implemented. >>> >>> >>> --=20 >>> Thanks, >>> Ruan Shiyang. >>> >>>> >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D dev_= pagemap_mapping_shift(p, vma,=20 >>>>> tk->addr); >>>>> +=C2=A0=C2=A0=C2=A0 } else >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shi= ft =3D page_shift(compound_head(p)); >>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page=20 >>>>> *page, struct list_head *to_kill, >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 if (!page_mapped_in_vma(page, vma)) >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 continue; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, NULL, 0, vma, to_kill); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page=20 >>>>> *page, struct list_head *to_kill, >>>>> =C2=A0 /* >>>>> =C2=A0=C2=A0 * Collect processes when the error hit a file mapped p= age. >>>>> =C2=A0=C2=A0 */ >>>>> -static void collect_procs_file(struct page *page, struct=20 >>>>> list_head *to_kill, >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 int force_early) >>>>> +static void collect_procs_file(struct page *page, struct=20 >>>>> address_space *mapping, >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff_t pgoff, struct l= ist_head *to_kill, int force_early) >>>>> =C2=A0 { >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task_struct *tsk; >>>>> -=C2=A0=C2=A0=C2=A0 struct address_space *mapping =3D page->mapping= ; >>>>> -=C2=A0=C2=A0=C2=A0 pgoff_t pgoff; >>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i_mmap_lock_read(mapping); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_lock(&tasklist_lock); >>>>> -=C2=A0=C2=A0=C2=A0 pgoff =3D page_to_pgoff(page); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_process(tsk) { >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task_= struct *t =3D task_early_kill(tsk, force_early); >>>>> - >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!t) >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 continue; >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_forea= ch(vma, &mapping->i_mmap, pgoff, >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff) { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_forea= ch(vma, &mapping->i_mmap, pgoff,=20 >>>>> pgoff) { >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 /* >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 * Send early kill signal to tasks where a vma covers >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 * the page but the corrupted page is not necessarily >>>>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page=20 >>>>> *page, struct list_head *to_kill, >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 * to be informed of all such data corruptions. >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 */ >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, mapping, pgoff, vma, to_kil= l); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page,=20 >>>>> struct list_head *tokill, >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page)) >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_proc= s_anon(page, tokill, force_early); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else >>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(page= , tokill, force_early); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(page= , page->mapping, page_to_pgoff(page), >>>> >>>> Why not use page_mapping() helper here? It would be safer for THPs=20 >>>> if they >>>> ever get here... >>>> >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Honza >>>> >>> >> >> >