From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DC2E4C433E0 for ; Thu, 14 Jan 2021 01:44:30 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 460C72343B for ; Thu, 14 Jan 2021 01:44:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 460C72343B Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=cn.fujitsu.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 76F676B0178; Wed, 13 Jan 2021 20:44:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 71FCA6B017A; Wed, 13 Jan 2021 20:44:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 610108D008E; Wed, 13 Jan 2021 20:44:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0178.hostedemail.com [216.40.44.178]) by kanga.kvack.org (Postfix) with ESMTP id 4B2A56B0178 for ; Wed, 13 Jan 2021 20:44:29 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 0705F824556B for ; Thu, 14 Jan 2021 01:44:29 +0000 (UTC) X-FDA: 77702685858.17.error25_050584327522 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id DF86D180D0185 for ; Thu, 14 Jan 2021 01:44:28 +0000 (UTC) X-HE-Tag: error25_050584327522 X-Filterd-Recvd-Size: 8731 Received: from heian.cn.fujitsu.com (mail.cn.fujitsu.com [183.91.158.132]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Thu, 14 Jan 2021 01:44:26 +0000 (UTC) X-IronPort-AV: E=Sophos;i="5.79,345,1602518400"; d="scan'208";a="103460770" Received: from unknown (HELO cn.fujitsu.com) ([10.167.33.5]) by heian.cn.fujitsu.com with ESMTP; 14 Jan 2021 09:44:21 +0800 Received: from G08CNEXMBPEKD05.g08.fujitsu.local (unknown [10.167.33.204]) by cn.fujitsu.com (Postfix) with ESMTP id 6AB404CE1A08; Thu, 14 Jan 2021 09:44:16 +0800 (CST) Received: from irides.mr (10.167.225.141) by G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Thu, 14 Jan 2021 09:44:17 +0800 Subject: Re: [PATCH 04/10] mm, fsdax: Refactor memory-failure handler for dax mapping To: zhong jiang , Jan Kara CC: , , , , , , , , , , , , , References: <20201230165601.845024-1-ruansy.fnst@cn.fujitsu.com> <20201230165601.845024-5-ruansy.fnst@cn.fujitsu.com> <20210106154132.GC29271@quack2.suse.cz> <75164044-bfdf-b2d6-dff0-d6a8d56d1f62@cn.fujitsu.com> <781f276b-afdd-091c-3dba-048e415431ab@linux.alibaba.com> From: Ruan Shiyang Message-ID: Date: Thu, 14 Jan 2021 09:44:14 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.1 MIME-Version: 1.0 In-Reply-To: <781f276b-afdd-091c-3dba-048e415431ab@linux.alibaba.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US X-Originating-IP: [10.167.225.141] X-ClientProxiedBy: G08CNEXCHPEKD04.g08.fujitsu.local (10.167.33.200) To G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) X-yoursite-MailScanner-ID: 6AB404CE1A08.AD0C5 X-yoursite-MailScanner: Found to be clean X-yoursite-MailScanner-From: ruansy.fnst@cn.fujitsu.com Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2021/1/13 =E4=B8=8B=E5=8D=886:04, zhong jiang wrote: >=20 > On 2021/1/12 10:55 =E4=B8=8A=E5=8D=88, Ruan Shiyang wrote: >> >> >> On 2021/1/6 =E4=B8=8B=E5=8D=8811:41, Jan Kara wrote: >>> On Thu 31-12-20 00:55:55, Shiyang Ruan wrote: >>>> The current memory_failure_dev_pagemap() can only handle single-mapp= ed >>>> dax page for fsdax mode.=C2=A0 The dax page could be mapped by multi= ple=20 >>>> files >>>> and offsets if we let reflink feature & fsdax mode work together.=C2= =A0 So, >>>> we refactor current implementation to support handle memory failure = on >>>> each file and offset. >>>> >>>> Signed-off-by: Shiyang Ruan >>> >>> Overall this looks OK to me, a few comments below. >>> >>>> --- >>>> =C2=A0 fs/dax.c=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 | 21 +++++++++++ >>>> =C2=A0 include/linux/dax.h |=C2=A0 1 + >>>> =C2=A0 include/linux/mm.h=C2=A0 |=C2=A0 9 +++++ >>>> =C2=A0 mm/memory-failure.c | 91=20 >>>> ++++++++++++++++++++++++++++++++++----------- >>>> =C2=A0 4 files changed, 100 insertions(+), 22 deletions(-) >> >> ... >> >>>> =C2=A0 @@ -345,9 +348,12 @@ static void add_to_kill(struct task_stru= ct=20 >>>> *tsk, struct page *p, >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->addr =3D page_address_in_v= ma(p, vma); >>>> -=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) >>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D dev_p= agemap_mapping_shift(p, vma); >>>> -=C2=A0=C2=A0=C2=A0 else >>>> +=C2=A0=C2=A0=C2=A0 if (is_zone_device_page(p)) { >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (is_device_fsdax_page= (p)) >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = tk->addr =3D vma->vm_start + >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ((pgoff - vma->vm_pgoff) <<= PAGE_SHIFT); >>> >>> It seems strange to use 'pgoff' for dax pages and not for any other=20 >>> page. >>> Why? I'd rather pass correct pgoff from all callers of add_to_kill() = and >>> avoid this special casing... >> >> Because one fsdax page can be shared by multiple pgoffs.=C2=A0 I have = to=20 >> pass each pgoff in each iteration to calculate the address in vma (for= =20 >> tk->addr).=C2=A0 Other kinds of pages don't need this. They can get th= eir=20 >> unique address by calling "page_address_in_vma()". >> > IMO,=C2=A0=C2=A0 an fsdax page can be shared by multiple files rather t= han=20 > multiple pgoffs if fs query support reflink.=C2=A0=C2=A0 Because an pag= e only=20 > located in an mapping(page->mapping is exclusive),=C2=A0 hence it=C2=A0= only has=20 > an pgoff or index pointing at the node. >=20 > =C2=A0or=C2=A0 I miss something for the feature ?=C2=A0 thanks, Yes, a fsdax page is shared by multiple files because of reflink. I=20 think my description of 'pgoff' here is not correct. This 'pgoff' means=20 the offset within the a file. (We use rmap to find out all the sharing=20 files and their offsets.) So, I said that "can be shared by multiple=20 pgoffs". It's my bad. I think I should name it another word to avoid misunderstandings. -- Thanks, Ruan Shiyang. >=20 >> So, I added this fsdax case here.=C2=A0 This patchset only implemented= the=20 >> fsdax case, other cases also need to be added here if to be implemente= d. >> >> >> --=20 >> Thanks, >> Ruan Shiyang. >> >>> >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shift =3D dev_p= agemap_mapping_shift(p, vma, tk->addr); >>>> +=C2=A0=C2=A0=C2=A0 } else >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 tk->size_shif= t =3D page_shift(compound_head(p)); >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>> @@ -495,7 +501,7 @@ static void collect_procs_anon(struct page=20 >>>> *page, struct list_head *to_kill, >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 if (!page_mapped_in_vma(page, vma)) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 continue; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, NULL, 0, vma, to_kill); >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>> @@ -505,24 +511,19 @@ static void collect_procs_anon(struct page=20 >>>> *page, struct list_head *to_kill, >>>> =C2=A0 /* >>>> =C2=A0=C2=A0 * Collect processes when the error hit a file mapped pa= ge. >>>> =C2=A0=C2=A0 */ >>>> -static void collect_procs_file(struct page *page, struct list_head=20 >>>> *to_kill, >>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 int force_early) >>>> +static void collect_procs_file(struct page *page, struct=20 >>>> address_space *mapping, >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff_t pgoff, struct li= st_head *to_kill, int force_early) >>>> =C2=A0 { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct vm_area_struct *vma; >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task_struct *tsk; >>>> -=C2=A0=C2=A0=C2=A0 struct address_space *mapping =3D page->mapping; >>>> -=C2=A0=C2=A0=C2=A0 pgoff_t pgoff; >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i_mmap_lock_read(mapping); >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_lock(&tasklist_lock); >>>> -=C2=A0=C2=A0=C2=A0 pgoff =3D page_to_pgoff(page); >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 for_each_process(tsk) { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 struct task_s= truct *t =3D task_early_kill(tsk, force_early); >>>> - >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!t) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 continue; >>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_foreac= h(vma, &mapping->i_mmap, pgoff, >>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pgoff) { >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vma_interval_tree_foreac= h(vma, &mapping->i_mmap, pgoff,=20 >>>> pgoff) { >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 /* >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 * Send early kill signal to tasks where a vma covers >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 * the page but the corrupted page is not necessarily >>>> @@ -531,7 +532,7 @@ static void collect_procs_file(struct page=20 >>>> *page, struct list_head *to_kill, >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 * to be informed of all such data corruptions. >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 */ >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 if (vma->vm_mm =3D=3D t->mm) >>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, vma, to_kill); >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 add_to_kill(t, page, mapping, pgoff, vma, to_kill); >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 read_unlock(&tasklist_lock); >>>> @@ -550,7 +551,8 @@ static void collect_procs(struct page *page,=20 >>>> struct list_head *tokill, >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (PageAnon(page)) >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs= _anon(page, tokill, force_early); >>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 else >>>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(page,= tokill, force_early); >>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 collect_procs_file(page,= page->mapping, page_to_pgoff(page), >>> >>> Why not use page_mapping() helper here? It would be safer for THPs if= =20 >>> they >>> ever get here... >>> >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Honza >>> >> >=20 >=20