From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58CDCC433DB for ; Fri, 8 Jan 2021 09:52:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A7570235FF for ; Fri, 8 Jan 2021 09:52:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A7570235FF Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=cn.fujitsu.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AA7D88D0173; Fri, 8 Jan 2021 04:52:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A56A18D0156; Fri, 8 Jan 2021 04:52:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F9148D0173; Fri, 8 Jan 2021 04:52:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0117.hostedemail.com [216.40.44.117]) by kanga.kvack.org (Postfix) with ESMTP id 7903F8D0156 for ; Fri, 8 Jan 2021 04:52:29 -0500 (EST) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 3E81110F5F for ; Fri, 8 Jan 2021 09:52:29 +0000 (UTC) X-FDA: 77682142818.21.bird87_4511983274f1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin21.hostedemail.com (Postfix) with ESMTP id 2251318059D7B for ; Fri, 8 Jan 2021 09:52:29 +0000 (UTC) X-HE-Tag: bird87_4511983274f1 X-Filterd-Recvd-Size: 11855 Received: from heian.cn.fujitsu.com (mail.cn.fujitsu.com [183.91.158.132]) by imf35.hostedemail.com (Postfix) with ESMTP for ; Fri, 8 Jan 2021 09:52:27 +0000 (UTC) X-IronPort-AV: E=Sophos;i="5.79,330,1602518400"; d="scan'208";a="103307194" Received: from unknown (HELO cn.fujitsu.com) ([10.167.33.5]) by heian.cn.fujitsu.com with ESMTP; 08 Jan 2021 17:52:15 +0800 Received: from G08CNEXMBPEKD05.g08.fujitsu.local (unknown [10.167.33.204]) by cn.fujitsu.com (Postfix) with ESMTP id 87C184CE602C; Fri, 8 Jan 2021 17:52:13 +0800 (CST) Received: from irides.mr (10.167.225.141) by G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Fri, 8 Jan 2021 17:52:12 +0800 Subject: Re: [RFC PATCH v3 8/9] md: Implement ->corrupted_range() To: "Darrick J. Wong" CC: , , , , , , , , , , , , , Theodore Ts'o References: <20201215121414.253660-1-ruansy.fnst@cn.fujitsu.com> <20201215121414.253660-9-ruansy.fnst@cn.fujitsu.com> <20201215205102.GB6918@magnolia> <20210104233423.GR6918@magnolia> From: Ruan Shiyang Message-ID: <77ecf385-0edc-6576-8963-867adbb9405b@cn.fujitsu.com> Date: Fri, 8 Jan 2021 17:52:11 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <20210104233423.GR6918@magnolia> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US X-Originating-IP: [10.167.225.141] X-ClientProxiedBy: G08CNEXCHPEKD04.g08.fujitsu.local (10.167.33.200) To G08CNEXMBPEKD05.g08.fujitsu.local (10.167.33.204) X-yoursite-MailScanner-ID: 87C184CE602C.AE9C8 X-yoursite-MailScanner: Found to be clean X-yoursite-MailScanner-From: ruansy.fnst@cn.fujitsu.com Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2021/1/5 =E4=B8=8A=E5=8D=887:34, Darrick J. Wong wrote: > On Fri, Dec 18, 2020 at 10:11:54AM +0800, Ruan Shiyang wrote: >> >> >> On 2020/12/16 =E4=B8=8A=E5=8D=884:51, Darrick J. Wong wrote: >>> On Tue, Dec 15, 2020 at 08:14:13PM +0800, Shiyang Ruan wrote: >>>> With the support of ->rmap(), it is possible to obtain the superbloc= k on >>>> a mapped device. >>>> >>>> If a pmem device is used as one target of mapped device, we cannot >>>> obtain its superblock directly. With the help of SYSFS, the mapped >>>> device can be found on the target devices. So, we iterate the >>>> bdev->bd_holder_disks to obtain its mapped device. >>>> >>>> Signed-off-by: Shiyang Ruan >>>> --- >>>> drivers/md/dm.c | 66 ++++++++++++++++++++++++++++++++++++++= +++++ >>>> drivers/nvdimm/pmem.c | 9 ++++-- >>>> fs/block_dev.c | 21 ++++++++++++++ >>>> include/linux/genhd.h | 7 +++++ >>>> 4 files changed, 100 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >>>> index 4e0cbfe3f14d..9da1f9322735 100644 >>>> --- a/drivers/md/dm.c >>>> +++ b/drivers/md/dm.c >>>> @@ -507,6 +507,71 @@ static int dm_blk_report_zones(struct gendisk *= disk, sector_t sector, >>>> #define dm_blk_report_zones NULL >>>> #endif /* CONFIG_BLK_DEV_ZONED */ >>>> +struct dm_blk_corrupt { >>>> + struct block_device *bdev; >>>> + sector_t offset; >>>> +}; >>>> + >>>> +static int dm_blk_corrupt_fn(struct dm_target *ti, struct dm_dev *d= ev, >>>> + sector_t start, sector_t len, void *data) >>>> +{ >>>> + struct dm_blk_corrupt *bc =3D data; >>>> + >>>> + return bc->bdev =3D=3D (void *)dev->bdev && >>>> + (start <=3D bc->offset && bc->offset < start + len); >>>> +} >>>> + >>>> +static int dm_blk_corrupted_range(struct gendisk *disk, >>>> + struct block_device *target_bdev, >>>> + loff_t target_offset, size_t len, void *data) >>>> +{ >>>> + struct mapped_device *md =3D disk->private_data; >>>> + struct block_device *md_bdev =3D md->bdev; >>>> + struct dm_table *map; >>>> + struct dm_target *ti; >>>> + struct super_block *sb; >>>> + int srcu_idx, i, rc =3D 0; >>>> + bool found =3D false; >>>> + sector_t disk_sec, target_sec =3D to_sector(target_offset); >>>> + >>>> + map =3D dm_get_live_table(md, &srcu_idx); >>>> + if (!map) >>>> + return -ENODEV; >>>> + >>>> + for (i =3D 0; i < dm_table_get_num_targets(map); i++) { >>>> + ti =3D dm_table_get_target(map, i); >>>> + if (ti->type->iterate_devices && ti->type->rmap) { >>>> + struct dm_blk_corrupt bc =3D {target_bdev, target_sec}; >>>> + >>>> + found =3D ti->type->iterate_devices(ti, dm_blk_corrupt_fn, &bc); >>>> + if (!found) >>>> + continue; >>>> + disk_sec =3D ti->type->rmap(ti, target_sec); >>> >>> What happens if the dm device has multiple reverse mappings because t= he >>> physical storage is being shared at multiple LBAs? (e.g. a >>> deduplication target) >> >> I thought that the dm device knows the mapping relationship, and it ca= n be >> done by implementation of ->rmap() in each target. Did I understand i= t >> wrong? >=20 > The dm device /does/ know the mapping relationship. I'm asking what > happens if there are *multiple* mappings. For example, a deduplicating > dm device could observe that the upper level code wrote some data to > sector 200 and now it wants to write the same data to sector 500. > Instead of writing twice, it simply maps sector 500 in its LBA space to > the same space that it mapped sector 200. >=20 > Pretend that sector 200 on the dm-dedupe device maps to sector 64 on th= e > underlying storage (call it /dev/pmem1 and let's say it's the only > target sitting underneath the dm-dedupe device). >=20 > If /dev/pmem1 then notices that sector 64 has gone bad, it will start > calling ->corrupted_range handlers until it calls dm_blk_corrupted_rang= e > on the dm-dedupe device. At least in theory, the dm-dedupe driver's > rmap method ought to return both (64 -> 200) and (64 -> 500) so that > dm_blk_corrupted_range can pass on both corruption notices to whatever'= s > sitting atop the dedupe device. >=20 > At the moment, your ->rmap prototype is only capable of returning one > sector_t mapping per target, and there's only the one target under the > dedupe device, so we cannot report the loss of sectors 200 and 500 to > whatever device is sitting on top of dm-dedupe. Got it. I didn't know there is a kind of dm device called dm-dedupe.=20 Thanks for the guidance. -- Thanks, Ruan Shiyang. >=20 > --D >=20 >>> >>>> + break; >>>> + } >>>> + } >>>> + >>>> + if (!found) { >>>> + rc =3D -ENODEV; >>>> + goto out; >>>> + } >>>> + >>>> + sb =3D get_super(md_bdev); >>>> + if (!sb) { >>>> + rc =3D bd_disk_holder_corrupted_range(md_bdev, to_bytes(disk_sec)= , len, data); >>>> + goto out; >>>> + } else if (sb->s_op->corrupted_range) { >>>> + loff_t off =3D to_bytes(disk_sec - get_start_sect(md_bdev)); >>>> + >>>> + rc =3D sb->s_op->corrupted_range(sb, md_bdev, off, len, data); >>> >>> This "call bd_disk_holder_corrupted_range or sb->s_op->corrupted_rang= e" >>> logic appears twice; should it be refactored into a common helper? >>> >>> Or, should the superblock dispatch part move to >>> bd_disk_holder_corrupted_range? >> >> bd_disk_holder_corrupted_range() requires SYSFS configuration. I intr= oduce >> it to handle those block devices that can not obtain superblock by >> `get_super()`. >> >> Usually, if we create filesystem directly on a pmem device, or make so= me >> partitions at first, we can use `get_super()` to get the superblock. = In >> other case, such as creating a LVM on pmem device, `get_super()` does = not >> work. >> >> So, I think refactoring it into a common helper looks better. >> >> >> -- >> Thanks, >> Ruan Shiyang. >> >>> >>>> + } >>>> + drop_super(sb); >>>> + >>>> +out: >>>> + dm_put_live_table(md, srcu_idx); >>>> + return rc; >>>> +} >>>> + >>>> static int dm_prepare_ioctl(struct mapped_device *md, int *srcu_i= dx, >>>> struct block_device **bdev) >>>> { >>>> @@ -3084,6 +3149,7 @@ static const struct block_device_operations dm= _blk_dops =3D { >>>> .getgeo =3D dm_blk_getgeo, >>>> .report_zones =3D dm_blk_report_zones, >>>> .pr_ops =3D &dm_pr_ops, >>>> + .corrupted_range =3D dm_blk_corrupted_range, >>>> .owner =3D THIS_MODULE >>>> }; >>>> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c >>>> index 4688bff19c20..e8cfaf860149 100644 >>>> --- a/drivers/nvdimm/pmem.c >>>> +++ b/drivers/nvdimm/pmem.c >>>> @@ -267,11 +267,14 @@ static int pmem_corrupted_range(struct gendisk= *disk, struct block_device *bdev, >>>> bdev_offset =3D (disk_sector - get_start_sect(bdev)) << SECTOR_S= HIFT; >>>> sb =3D get_super(bdev); >>>> - if (sb && sb->s_op->corrupted_range) { >>>> + if (!sb) { >>>> + rc =3D bd_disk_holder_corrupted_range(bdev, bdev_offset, len, dat= a); >>>> + goto out; >>>> + } else if (sb->s_op->corrupted_range) >>>> rc =3D sb->s_op->corrupted_range(sb, bdev, bdev_offset, len, da= ta); >>>> - drop_super(sb); >>> >>> This is out of scope for this patch(set) but do you think that the sc= si >>> disk driver should intercept media errors from sense data and call >>> ->corrupted_range too? ISTR Ted muttering that one of his employers = had >>> a patchset to do more with sense data than the upstream kernel curren= tly >>> does... >>> >>>> - } >>>> + drop_super(sb); >>>> +out: >>>> bdput(bdev); >>>> return rc; >>>> } >>>> diff --git a/fs/block_dev.c b/fs/block_dev.c >>>> index 9e84b1928b94..d3e6bddb8041 100644 >>>> --- a/fs/block_dev.c >>>> +++ b/fs/block_dev.c >>>> @@ -1171,6 +1171,27 @@ struct bd_holder_disk { >>>> int refcnt; >>>> }; >>>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_= t off, size_t len, void *data) >>>> +{ >>>> + struct bd_holder_disk *holder; >>>> + struct gendisk *disk; >>>> + int rc =3D 0; >>>> + >>>> + if (list_empty(&(bdev->bd_holder_disks))) >>>> + return -ENODEV; >>>> + >>>> + list_for_each_entry(holder, &bdev->bd_holder_disks, list) { >>>> + disk =3D holder->disk; >>>> + if (disk->fops->corrupted_range) { >>>> + rc =3D disk->fops->corrupted_range(disk, bdev, off, len, data); >>>> + if (rc !=3D -ENODEV) >>>> + break; >>>> + } >>>> + } >>>> + return rc; >>>> +} >>>> +EXPORT_SYMBOL_GPL(bd_disk_holder_corrupted_range); >>>> + >>>> static struct bd_holder_disk *bd_find_holder_disk(struct block_de= vice *bdev, >>>> struct gendisk *disk) >>>> { >>>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h >>>> index ed06209008b8..fba247b852fa 100644 >>>> --- a/include/linux/genhd.h >>>> +++ b/include/linux/genhd.h >>>> @@ -382,9 +382,16 @@ int blkdev_ioctl(struct block_device *, fmode_t= , unsigned, unsigned long); >>>> long compat_blkdev_ioctl(struct file *, unsigned, unsigned long); >>>> #ifdef CONFIG_SYSFS >>>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_= t off, >>>> + size_t len, void *data); >>>> int bd_link_disk_holder(struct block_device *bdev, struct gendisk= *disk); >>>> void bd_unlink_disk_holder(struct block_device *bdev, struct gend= isk *disk); >>>> #else >>>> +int bd_disk_holder_corrupted_range(struct block_device *bdev, loff_= t off, >>>> + size_t len, void *data) >>>> +{ >>>> + return 0; >>>> +} >>>> static inline int bd_link_disk_holder(struct block_device *bdev, >>>> struct gendisk *disk) >>>> { >>>> --=20 >>>> 2.29.2 >>>> >>>> >>>> >>> >>> >> >> >=20 >=20