From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D66DC433E0 for ; Mon, 10 Aug 2020 08:15:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DB015206B5 for ; Mon, 10 Aug 2020 08:15:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DB015206B5 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=cn.fujitsu.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2D2526B0006; Mon, 10 Aug 2020 04:15:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 25A378D0001; Mon, 10 Aug 2020 04:15:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0FDC96B0008; Mon, 10 Aug 2020 04:15:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0237.hostedemail.com [216.40.44.237]) by kanga.kvack.org (Postfix) with ESMTP id E8C9F6B0006 for ; Mon, 10 Aug 2020 04:15:52 -0400 (EDT) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id A1375365D for ; Mon, 10 Aug 2020 08:15:52 +0000 (UTC) X-FDA: 77133950544.17.maid10_43161f126fd8 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id 6D245180D0181 for ; Mon, 10 Aug 2020 08:15:52 +0000 (UTC) X-HE-Tag: maid10_43161f126fd8 X-Filterd-Recvd-Size: 10924 Received: from heian.cn.fujitsu.com (mail.cn.fujitsu.com [183.91.158.132]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Mon, 10 Aug 2020 08:15:50 +0000 (UTC) X-IronPort-AV: E=Sophos;i="5.75,456,1589212800"; d="scan'208";a="97927301" Received: from unknown (HELO cn.fujitsu.com) ([10.167.33.5]) by heian.cn.fujitsu.com with ESMTP; 10 Aug 2020 16:07:18 +0800 Received: from G08CNEXMBPEKD05.g08.fujitsu.local (unknown [10.167.33.204]) by cn.fujitsu.com (Postfix) with ESMTP id 566BF4CE38A1; Mon, 10 Aug 2020 16:07:18 +0800 (CST) Received: from irides.mr (10.167.225.141) by G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 10 Aug 2020 16:07:21 +0800 Subject: Re: [RFC PATCH 1/8] fs: introduce get_shared_files() for dax&reflink To: "Darrick J. Wong" CC: , , , , , , , , , , References: <20200807131336.318774-1-ruansy.fnst@cn.fujitsu.com> <20200807131336.318774-2-ruansy.fnst@cn.fujitsu.com> <20200807161526.GD6090@magnolia> From: Ruan Shiyang Message-ID: <5883fb44-4fdb-1068-b8fd-4ceab391c2a6@cn.fujitsu.com> Date: Mon, 10 Aug 2020 16:07:08 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <20200807161526.GD6090@magnolia> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US X-Originating-IP: [10.167.225.141] X-ClientProxiedBy: G08CNEXCHPEKD06.g08.fujitsu.local (10.167.33.205) To G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) X-yoursite-MailScanner-ID: 566BF4CE38A1.AFADE X-yoursite-MailScanner: Found to be clean X-yoursite-MailScanner-From: ruansy.fnst@cn.fujitsu.com X-Rspamd-Queue-Id: 6D245180D0181 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2020/8/8 =E4=B8=8A=E5=8D=8812:15, Darrick J. Wong wrote: > On Fri, Aug 07, 2020 at 09:13:29PM +0800, Shiyang Ruan wrote: >> Under the mode of both dax and reflink on, one page may be shared by >> multiple files and offsets. In order to track them in memory-failure = or >> other cases, we introduce this function by finding out who is sharing >> this block(the page) in a filesystem. It returns a list that contains >> all the owners, and the offset in each owner. >> >> For XFS, rmapbt is used to find out the owners of one block. So, it >> should be turned on when we want to use dax&reflink feature together. >> >> Signed-off-by: Shiyang Ruan >> --- >> fs/xfs/xfs_super.c | 67 +++++++++++++++++++++++++++++++++++++++++++= ++ >> include/linux/dax.h | 7 +++++ >> include/linux/fs.h | 2 ++ >> 3 files changed, 76 insertions(+) >> >> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c >> index 379cbff438bc..b71392219c91 100644 >> --- a/fs/xfs/xfs_super.c >> +++ b/fs/xfs/xfs_super.c >> @@ -35,6 +35,9 @@ >> #include "xfs_refcount_item.h" >> #include "xfs_bmap_item.h" >> #include "xfs_reflink.h" >> +#include "xfs_alloc.h" >> +#include "xfs_rmap.h" >> +#include "xfs_rmap_btree.h" >> =20 >> #include >> #include >> @@ -1097,6 +1100,69 @@ xfs_fs_free_cached_objects( >> return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan); >> } >> =20 >> +static int _get_shared_files_fn( >=20 > Needs an xfs_ prefix... >=20 >> + struct xfs_btree_cur *cur, >> + struct xfs_rmap_irec *rec, >> + void *priv) >> +{ >> + struct list_head *list =3D priv; >> + struct xfs_inode *ip; >> + struct shared_files *sfp; >> + >> + /* Get files that incore, filter out others that are not in use. */ >> + xfs_iget(cur->bc_mp, cur->bc_tp, rec->rm_owner, XFS_IGET_INCORE, 0, = &ip); >=20 > No error checking at all? >=20 > What if rm_owner refers to metadata? Yes, we need to check whether the page contains metadata. I remembered=20 that. I wrote the check code but removed it in this patch, because I=20 didn't find a way to associate this block device with the dax page that=20 contains metadata. We can call dax_associate_entry() to create the=20 association if the page's owner is a file, but it's not work for=20 metadata. I should have explained here. >=20 >> + if (ip && !ip->i_vnode.i_mapping) >> + return 0; >=20 > When is the xfs_inode released? We don't iput it here, and there's no > way for dax_unlock_page (afaict the only consumer) to do it, so we > leak the reference. Do you mean xfs_irele() ? Sorry, I didn't realize that. All we need is=20 get the ->mapping form a given inode number. So, I think we can call=20 xfs_irele() when exiting this function. >=20 >> + >> + sfp =3D kmalloc(sizeof(*sfp), GFP_KERNEL); >=20 > If there are millions of open files reflinked to this range of pmem thi= s > is going to allocate a /lot/ of memory. >=20 >> + sfp->mapping =3D ip->i_vnode.i_mapping; >=20 > sfp->mapping =3D VFS_I(ip)->i_mapping; >=20 >> + sfp->index =3D rec->rm_offset; >> + list_add_tail(&sfp->list, list); >=20 > Why do we leave ->cookie uninitialized? What does it even do? It's my fault. ->cookie should have been added in the next patch. It=20 stores each owner's dax entry when calling dax_lock_page() in=20 memory-failure. >=20 >> + >> + return 0; >> +} >> + >> +static int >> +xfs_fs_get_shared_files( >> + struct super_block *sb, >> + pgoff_t offset, >=20 > Which device does this offset refer to? XFS supports multiple storage > devices. >=20 > Also, uh, is this really a pgoff_t? If yes, you can't use it with > XFS_B_TO_FSB below without first converting it to a loff_t. The offset here is assigned as iomap->addr, which is obtained from=20 iomap_begin(). So that we can easily looking up for the owners of this=20 offset. I don't quite understand what you said about supporting multiple storage=20 devices. What are these devices? Do you mean NVDIMM, HDD, SSD and other= s? >=20 >> + struct list_head *list) >> +{ >> + struct xfs_mount *mp =3D XFS_M(sb); >> + struct xfs_trans *tp =3D NULL; >> + struct xfs_btree_cur *cur =3D NULL; >> + struct xfs_rmap_irec rmap_low =3D { 0 }, rmap_high =3D { 0 }; >=20 > No need to memset(0) rmap_low later, or zero rmap_high just to memset i= t > later. >=20 >> + struct xfs_buf *agf_bp =3D NULL; >> + xfs_agblock_t bno =3D XFS_B_TO_FSB(mp, offset); >=20 > "FSB" refers to xfs_fsblock_t. You just ripped the upper 32 bits off > the fsblock number. I think I misused these types. Sorry for that. >=20 >> + xfs_agnumber_t agno =3D XFS_FSB_TO_AGNO(mp, bno); >> + int error =3D 0; >> + >> + error =3D xfs_trans_alloc_empty(mp, &tp); >> + if (error) >> + return error; >> + >> + error =3D xfs_alloc_read_agf(mp, tp, agno, 0, &agf_bp); >> + if (error) >> + return error; >> + >> + cur =3D xfs_rmapbt_init_cursor(mp, tp, agf_bp, agno); >> + >> + memset(&cur->bc_rec, 0, sizeof(cur->bc_rec)); >=20 > Not necessary, bc_rec is zero in a freshly created cursor. >=20 >> + /* Construct the range for one rmap search */ >> + memset(&rmap_low, 0, sizeof(rmap_low)); >> + memset(&rmap_high, 0xFF, sizeof(rmap_high)); >> + rmap_low.rm_startblock =3D rmap_high.rm_startblock =3D bno; >> + >> + error =3D xfs_rmap_query_range(cur, &rmap_low, &rmap_high, >> + _get_shared_files_fn, list); >> + if (error =3D=3D -ECANCELED) >> + error =3D 0; >> + >> + xfs_btree_del_cursor(cur, error); >> + xfs_trans_brelse(tp, agf_bp); >> + return error; >> +} >=20 > Looking at this, I don't think this is the right way to approach memory > poisoning. Rather than allocating a (potentially huge) linked list and > passing it to the memory poison code to unmap pages, kill processes, an= d > free the list, I think: >=20 > 1) "->get_shared_files" should be more targetted. Call it ->storage_lo= st > or something, so that it only has one purpose, which is to react to > asynchronous notifications that storage has been lost. Yes, it's better. I only considered file tracking. >=20 > 2) The inner _get_shared_files_fn should directly call back into the > memory manager to remove a poisoned page from the mapping and signal > whatever process might have it mapped. For the error handling part, it's really reasonable. But we have to=20 call dax_lock_page() at the beginning of memory-failure, which also need=20 to iterate all the owners in order to lock their dax entries. I think=20 it not supposed to call the lock in each iteration instead of at the=20 beginning. >=20 > That way, _get_shared_files_fn can look in the xfs buffer cache to see > if we have a copy in DRAM, and immediately write it back to pmem. I didn't think of fixing the storage device. Will take it into=20 consideration. >=20 > Hmm and now that you've gotten me rambling about hwpoison, I wonder wha= t > happens if dram backing part of the xfs buffer cache goes bad... Yes, so many possible situations to consider. For the current stage,=20 just shutdown the filesystem if memory failures on metadata, and kill=20 user processes if failures on normal files. Is that OK? Anyway, thanks for reviewing. -- Thanks, Ruan Shiyang. >=20 > --D >=20 >> + >> static const struct super_operations xfs_super_operations =3D { >> .alloc_inode =3D xfs_fs_alloc_inode, >> .destroy_inode =3D xfs_fs_destroy_inode, >> @@ -1110,6 +1176,7 @@ static const struct super_operations xfs_super_o= perations =3D { >> .show_options =3D xfs_fs_show_options, >> .nr_cached_objects =3D xfs_fs_nr_cached_objects, >> .free_cached_objects =3D xfs_fs_free_cached_objects, >> + .get_shared_files =3D xfs_fs_get_shared_files, >> }; >> =20 >> static int >> diff --git a/include/linux/dax.h b/include/linux/dax.h >> index 6904d4e0b2e0..0a85e321d6b4 100644 >> --- a/include/linux/dax.h >> +++ b/include/linux/dax.h >> @@ -40,6 +40,13 @@ struct dax_operations { >> =20 >> extern struct attribute_group dax_attribute_group; >> =20 >> +struct shared_files { >> + struct list_head list; >> + struct address_space *mapping; >> + pgoff_t index; >> + dax_entry_t cookie; >> +}; >> + >> #if IS_ENABLED(CONFIG_DAX) >> struct dax_device *dax_get_by_host(const char *host); >> struct dax_device *alloc_dax(void *private, const char *host, >> diff --git a/include/linux/fs.h b/include/linux/fs.h >> index f5abba86107d..81de3d2739b9 100644 >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -1977,6 +1977,8 @@ struct super_operations { >> struct shrink_control *); >> long (*free_cached_objects)(struct super_block *, >> struct shrink_control *); >> + int (*get_shared_files)(struct super_block *sb, pgoff_t offset, >> + struct list_head *list); >> }; >> =20 >> /* >> --=20 >> 2.27.0 >> >> >> >=20 >=20