From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56F0DC433DB for ; Thu, 14 Jan 2021 03:52:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9DF952389E for ; Thu, 14 Jan 2021 03:52:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9DF952389E Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=cn.fujitsu.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EA1FF8D00B9; Wed, 13 Jan 2021 22:52:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E521D8D008E; Wed, 13 Jan 2021 22:52:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D68B58D00B9; Wed, 13 Jan 2021 22:52:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0212.hostedemail.com [216.40.44.212]) by kanga.kvack.org (Postfix) with ESMTP id C13B18D008E for ; Wed, 13 Jan 2021 22:52:22 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 89F011EFF for ; Thu, 14 Jan 2021 03:52:22 +0000 (UTC) X-FDA: 77703008124.06.color30_250d8a227523 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 779A410043E5F for ; Thu, 14 Jan 2021 03:52:22 +0000 (UTC) X-HE-Tag: color30_250d8a227523 X-Filterd-Recvd-Size: 10316 Received: from heian.cn.fujitsu.com (mail.cn.fujitsu.com [183.91.158.132]) by imf34.hostedemail.com (Postfix) with ESMTP for ; Thu, 14 Jan 2021 03:52:20 +0000 (UTC) X-IronPort-AV: E=Sophos;i="5.79,346,1602518400"; d="scan'208";a="103466040" Received: from unknown (HELO cn.fujitsu.com) ([10.167.33.5]) by heian.cn.fujitsu.com with ESMTP; 14 Jan 2021 11:52:18 +0800 Received: from G08CNEXMBPEKD05.g08.fujitsu.local (unknown [10.167.33.204]) by cn.fujitsu.com (Postfix) with ESMTP id 929244CE602D; Thu, 14 Jan 2021 11:52:16 +0800 (CST) Received: from irides.mr (10.167.225.141) by G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Thu, 14 Jan 2021 11:52:17 +0800 Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping To: zhong jiang , Jan Kara CC: , , , , , , , , , , , , , References: <20201230165601.845024-1-ruansy.fnst@cn.fujitsu.com> <20201230165601.845024-5-ruansy.fnst@cn.fujitsu.com> <20210106154132.GC29271@quack2.suse.cz> <75164044-bfdf-b2d6-dff0-d6a8d56d1f62@cn.fujitsu.com> <781f276b-afdd-091c-3dba-048e415431ab@linux.alibaba.com> From: Ruan Shiyang Message-ID: <4f184987-3cc2-c72d-0774-5d20ea2e1d49@cn.fujitsu.com> Date: Thu, 14 Jan 2021 11:52:14 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US X-Originating-IP: [10.167.225.141] X-ClientProxiedBy: G08CNEXCHPEKD04.g08.fujitsu.local (10.167.33.200) To G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) X-yoursite-MailScanner-ID: 929244CE602D.ADB7E X-yoursite-MailScanner: Found to be clean X-yoursite-MailScanner-From: ruansy.fnst@cn.fujitsu.com Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2021/1/14 =E4=B8=8A=E5=8D=8811:26, zhong jiang wrote: >=20 > On 2021/1/14 9:44 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: >> >> >> On 2021/1/13 =E4=B8=8B=E5=8D=886:04, zhong jiang wrote: >>> >>> On 2021/1/12 10:55 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: >>>> >>>> >>>> On 2021/1/6 =E4=B8=8B=E5=8D=8811:41, Jan Kara wrote: >>>>> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote: >>>>>> The current memory_failure_dev_pagemap() can only handle=20 >>>>>> single-mapped >>>>>> dax page for fsdax mode.=C2=A0 The dax page could be mapped by mul= tiple=20 >>>>>> files >>>>>> and offsets if we let reflink feature & fsdax mode work together. = So, >>>>>> we refactor current implementation to support handle memory=20 >>>>>> failure on >>>>>> each file and offset. >>>>>> >>>>>> Signed-off-by: Shiyang Ruan >>>>> >>>>> Overall this looks OK to me, a few comments below. >>>>> >>>>>> --- >>>>>> =C2=A0 fs/dax.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 | 21 +++++++++++ >>>>>> =C2=A0 include/linux/dax.h |=C2=A0 1 + >>>>>> =C2=A0 include/linux/mm.h=C2=A0 |=C2=A0 9 +++++ >>>>>> =C2=A0 mm/memory-failure.c | 91=20 >>>>>> ++++++++++++++++++++++++++++++++++----------- >>>>>> =C2=A0 4 files changed, 100 insertions(+), 22 deletions(-) >>>> >>>> ... >>>> >>>>>> =C2=A0 @@ -345,9 +348,12 @@ static void add_to_kill(struct task_st= ruct=20 >>>>>> *tsk, struct page *p, >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->addr =3D page_address_in= _vma(p, vma); >>>>>> -=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) >>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D dev= _pagemap_mapping_shift(p, vma); >>>>>> -=C2=A0=C2=A0=C2=A0 else >>>>>> +=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) { >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_device_fsdax_pa= ge(p)) >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= tk->addr =3D vma->vm_start + >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ((pgoff - vma->vm_pgoff)= << PAGE_SHIFT); >>>>> >>>>> It seems strange to use 'pgoff' for dax pages and not for any other= =20 >>>>> page. >>>>> Why? I'd rather pass correct pgoff from all callers of=20 >>>>> add_to_kill() and >>>>> avoid this special casing... >>>> >>>> Because one fsdax page can be shared by multiple pgoffs.=C2=A0 I hav= e to=20 >>>> pass each pgoff in each iteration to calculate the address in vma=20 >>>> (for tk->addr).=C2=A0 Other kinds of pages don't need this. They can= get=20 >>>> their unique address by calling "page_address_in_vma()". >>>> >>> IMO,=C2=A0=C2=A0 an fsdax page can be shared by multiple files rather= than=20 >>> multiple pgoffs if fs query support reflink.=C2=A0=C2=A0 Because an p= age only=20 >>> located in an mapping(page->mapping is exclusive), hence it=C2=A0 onl= y has=20 >>> an pgoff or index pointing at the node. >>> >>> =C2=A0=C2=A0or=C2=A0 I miss something for the feature ?=C2=A0 thanks, >> >> Yes, a fsdax page is shared by multiple files because of reflink. I=20 >> think my description of 'pgoff' here is not correct.=C2=A0 This 'pgoff= '=20 >> means the offset within the a file.=C2=A0 (We use rmap to find out all= the=20 >> sharing files and their offsets.)=C2=A0 So, I said that "can be shared= by=20 >> multiple pgoffs".=C2=A0 It's my bad. >> >> I think I should name it another word to avoid misunderstandings. >> > IMO,=C2=A0 All the sharing files should be the same offset to share the= fsdax=20 > page.=C2=A0 why not that ?=20 The dedupe operation can let different files share their same data=20 extent, though offsets are not same. So, files can share one fsdax page=20 at different offset. > As you has said,=C2=A0 a shared fadax page should be=20 > inserted to different mapping files.=C2=A0 but page->index and page->ma= pping=20 > is exclusive.=C2=A0 hence an page only should be placed in an mapping t= ree. We can't use page->mapping and page->index here for reflink & fsdax.=20 And that's this patchset aims to solve. I introduced a series of=20 ->corrupted_range(), from mm to pmem driver to block device and finally=20 to filesystem, to use rmap feature of filesystem to find out all files=20 sharing same data extent (fsdax page). -- Thanks, Ruan Shiyang. >=20 > And In the current patch,=C2=A0 we failed to found out that all process= use=20 > the fsdax page shared by multiple files and kill them. >=20 >=20 > Thanks, >=20 >> --=20 >> Thanks, >> Ruan Shiyang. >> >>> >>>> So, I added this fsdax case here.=C2=A0 This patchset only implement= ed=20 >>>> the fsdax case, other cases also need to be added here if to be=20 >>>> implemented. >>>> >>>> >>>> --=20 >>>> Thanks, >>>> Ruan Shiyang. >>>> >>>>> >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D dev= _pagemap_mapping_shift(p, vma,=20 >>>>>> tk->addr); >>>>>> +=C2=A0=C2=A0=C2=A0 } else >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_sh= ift =3D page_shift(compound_head(p)); >>>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>>>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page=20 >>>>>> *page, struct list_head *to_kill, >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (!page_mapped_in_vma(page, vma)) >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 continue; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, NULL, 0, vma, to_kill); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>>>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page=20 >>>>>> *page, struct list_head *to_kill, >>>>>> =C2=A0 /* >>>>>> =C2=A0=C2=A0 * Collect processes when the error hit a file mapped = page. >>>>>> =C2=A0=C2=A0 */ >>>>>> -static void collect_procs_file(struct page *page, struct=20 >>>>>> list_head *to_kill, >>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 int force_early) >>>>>> +static void collect_procs_file(struct page *page, struct=20 >>>>>> address_space *mapping, >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff_t pgoff, struct = list_head *to_kill, int force_early) >>>>>> =C2=A0 { >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task_struct *tsk; >>>>>> -=C2=A0=C2=A0=C2=A0 struct address_space *mapping =3D page->mappin= g; >>>>>> -=C2=A0=C2=A0=C2=A0 pgoff_t pgoff; >>>>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i_mmap_lock_read(mapping); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_lock(&tasklist_lock); >>>>>> -=C2=A0=C2=A0=C2=A0 pgoff =3D page_to_pgoff(page); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_process(tsk) { >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task= _struct *t =3D task_early_kill(tsk, force_early); >>>>>> - >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!t) >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 continue; >>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_fore= ach(vma, &mapping->i_mmap, pgoff, >>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff) { >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_fore= ach(vma, &mapping->i_mmap, pgoff,=20 >>>>>> pgoff) { >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 /* >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * Send early kill signal to tasks where a vma covers >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * the page but the corrupted page is not necessarily >>>>>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page=20 >>>>>> *page, struct list_head *to_kill, >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 * to be informed of all such data corruptions. >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 */ >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, mapping, pgoff, vma, to_kil= l); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>>>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page,=20 >>>>>> struct list_head *tokill, >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page)) >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_pro= cs_anon(page, tokill, force_early); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else >>>>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(pag= e, tokill, force_early); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(pag= e, page->mapping, page_to_pgoff(page), >>>>> >>>>> Why not use page_mapping() helper here? It would be safer for THPs=20 >>>>> if they >>>>> ever get here... >>>>> >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Honza >>>>> >>>> >>> >>> >> >=20 >=20