From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5B39C77B7E for ; Fri, 28 Apr 2023 03:47:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F1BA56B0071; Thu, 27 Apr 2023 23:47:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ECB6F900002; Thu, 27 Apr 2023 23:47:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6B796B0074; Thu, 27 Apr 2023 23:47:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C5CD86B0071 for ; Thu, 27 Apr 2023 23:47:36 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8F68D120153 for ; Fri, 28 Apr 2023 03:47:36 +0000 (UTC) X-FDA: 80729415312.04.035AAC4 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf05.hostedemail.com (Postfix) with ESMTP id 9C214100005 for ; Fri, 28 Apr 2023 03:47:33 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682653654; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hLqwTMtQ1am7r1strXFgIKWRCeGuzcEuSBwrbeDFXZI=; b=kAwK9Pel/ofbAq3Nt5I++5bvqpbIGlY09hpUj5dFt52RiGyjB/2ibhEijnR7AdQsf0EIsZ s1fOkRuJ/koZNJgDlrCk37d4dtG4LX76z/MlgGbSvti8F4LGCY0Tn9KRk9E0xSZoSpCk5y fLEbZ714Kxm4JKqf+qVhQ5UMDctrqB8= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of libaokun1@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=libaokun1@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682653654; a=rsa-sha256; cv=none; b=PHEGUqZKkTjjOTQa/WIMaHIi+M1GARTq15vkvKcjUzDquXYRbM5R1lMAo6JvhErHUwefkr ySuWaTb1T2jU/OA8kvV+azf6UYtgcD4Fh5Q3pqn2z+Ktp3kaZ5zGmoglkr4I5jWbs6uzOX y33ESWEnoJyEnaiNt/IKPprjkg0BJd0= Received: from dggpeml500021.china.huawei.com (unknown [172.30.72.54]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Q6z3R4R2vzpTMg; Fri, 28 Apr 2023 11:43:31 +0800 (CST) Received: from [10.174.177.174] (10.174.177.174) by dggpeml500021.china.huawei.com (7.185.36.21) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Fri, 28 Apr 2023 11:47:26 +0800 Message-ID: Date: Fri, 28 Apr 2023 11:47:26 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.1.2 Subject: Re: [ext4 io hang] buffered write io hang in balance_dirty_pages Content-Language: en-US To: Ming Lei CC: Matthew Wilcox , Theodore Ts'o , , Andreas Dilger , , Andrew Morton , , , Dave Chinner , Eric Sandeen , Christoph Hellwig , Zhang Yi , yangerkun , Baokun Li References: <663b10eb-4b61-c445-c07c-90c99f629c74@huawei.com> From: Baokun Li In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.174] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpeml500021.china.huawei.com (7.185.36.21) X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: hjtrjkjjdzsktx4z319fybhju1eyzcpn X-Rspamd-Queue-Id: 9C214100005 X-HE-Tag: 1682653653-143973 X-HE-Meta: U2FsdGVkX1+A2H8PM5DONaCFRStr3b0d+qz5s0knG0Su6lYVz78CEozzRdbeuoJOUanyQfYqLxFGWVV1j2rHcXOo8BXRILtve6oOTsyKoynStl3HfjRQQCVhdsoxQBSvtLCLQxJ2WB8yjOZAsDMp/vjx1mA/96zzN2cQpO+F97lYpI8ISR9OiSDsw+V3654dFNf0DiVLRBiEOH6JjtqJlMG3nyiqWDYRMRqsiARxV8NSsEsrQ3ZS3naecVvKZRxyJVVujXUfr4laSvCtHeeNiXg0iJF6o8bmkgcCQrI0Uls8TvOvnOxj3XZzqlI6wZZ3GBsRj5UZRZOsS5ahIl3sIuoJSynwCWmBrWnE0sNbYuylvpKbmgg7bI9A6fsmUyIHx/eRmCw/Tb56H5a78W7Wi7B5H+R0Ce1jhDBDooi6vcAtiVEZ8A7w7VbqW83lRfjhuOFh/MKoMOXphC9W5DjmHqoFf/oLdRNFe2qQKU5e1ElaS/bgkZ6zsWmGjNmX5udkhelPicdAobYf5D36Bu0zOpDazB0tJAhBoGb8cdILHPK6RRNCPfCPcrxrZnjfRLkeqB4VSvkf4N4T4FCRBt3Sm6Q2b6MvmpIIoP2btb7zSa934xaxJrzysUcJcKkz3VwWzXXGV7P7iasx8lz7erWJHaI7/aOSYxOs1AR8T/J7vI7m+PDZ+Qlz2mBiu+8LQePoEwL2Nu6knW1twFc8D1Ix9tdi/DTPYBgo9A56VMoUrohQE1JCfAb+w/ljz+Y7ywiuRd6h35Xrk0VwosAwyAS7gYZFr4c/u3akiFBYfAIWIcWdu8ElfO/aHrZt8uUNsZOziq5cdEG47J4yp0vwRdgw1ZX2POfD3Zh8a/bZVbFV3Pw7+OOTo0y2FYl+kO62Af96x6VZLU2fIs3l3D0WgrJCZa8IgSQZPUxOS1Qr6KQNJagLfoA8J5E/2EQ3DPCWkZtl8q4N4yM3dlMmiA7yxhk ycKCcom1 FDy3Rwb8H/SVF7SDy53n46X9gRVSRG46FOSKDC2yIuMUbnZXv32VrOJSFooM2907y76NcSfQ9T4VF+QL7ezTcKK97mApgVdANkZbBoss8VJmXGOIc8AuX2AAF3UKticHNbB364mPRhirCK/I8vXbgkJh9PWamtmqFUwhMaWBzeQr2M77Z7B4G8v/hDA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/4/28 9:41, Ming Lei wrote: > On Thu, Apr 27, 2023 at 07:27:04PM +0800, Ming Lei wrote: >> On Thu, Apr 27, 2023 at 07:19:35PM +0800, Baokun Li wrote: >>> On 2023/4/27 18:01, Ming Lei wrote: >>>> On Thu, Apr 27, 2023 at 02:36:51PM +0800, Baokun Li wrote: >>>>> On 2023/4/27 12:50, Ming Lei wrote: >>>>>> Hello Matthew, >>>>>> >>>>>> On Thu, Apr 27, 2023 at 04:58:36AM +0100, Matthew Wilcox wrote: >>>>>>> On Thu, Apr 27, 2023 at 10:20:28AM +0800, Ming Lei wrote: >>>>>>>> Hello Guys, >>>>>>>> >>>>>>>> I got one report in which buffered write IO hangs in balance_dirty_pages, >>>>>>>> after one nvme block device is unplugged physically, then umount can't >>>>>>>> succeed. >>>>>>> That's a feature, not a bug ... the dd should continue indefinitely? >>>>>> Can you explain what the feature is? And not see such 'issue' or 'feature' >>>>>> on xfs. >>>>>> >>>>>> The device has been gone, so IMO it is reasonable to see FS buffered write IO >>>>>> failed. Actually dmesg has shown that 'EXT4-fs (nvme0n1): Remounting >>>>>> filesystem read-only'. Seems these things may confuse user. >>>>> The reason for this difference is that ext4 and xfs handle errors >>>>> differently. >>>>> >>>>> ext4 remounts the filesystem as read-only or even just continues, vfs_write >>>>> does not check for these. >>>> vfs_write may not find anything wrong, but ext4 remount could see that >>>> disk is gone, which might happen during or after remount, however. >>>> >>>>> xfs shuts down the filesystem, so it returns a failure at >>>>> xfs_file_write_iter when it finds an error. >>>>> >>>>> >>>>> ``` ext4 >>>>> ksys_write >>>>>  vfs_write >>>>>   ext4_file_write_iter >>>>>    ext4_buffered_write_iter >>>>>     ext4_write_checks >>>>>      file_modified >>>>>       file_modified_flags >>>>>        __file_update_time >>>>>         inode_update_time >>>>>          generic_update_time >>>>>           __mark_inode_dirty >>>>>            ext4_dirty_inode ---> 2. void func, No propagating errors out >>>>>             __ext4_journal_start_sb >>>>>              ext4_journal_check_start ---> 1. Error found, remount-ro >>>>>     generic_perform_write ---> 3. No error sensed, continue >>>>>      balance_dirty_pages_ratelimited >>>>>       balance_dirty_pages_ratelimited_flags >>>>>        balance_dirty_pages >>>>>         // 4. Sleeping waiting for dirty pages to be freed >>>>>         __set_current_state(TASK_KILLABLE) >>>>>         io_schedule_timeout(pause); >>>>> ``` >>>>> >>>>> ``` xfs >>>>> ksys_write >>>>>  vfs_write >>>>>   xfs_file_write_iter >>>>>    if (xfs_is_shutdown(ip->i_mount)) >>>>>      return -EIO;    ---> dd fail >>>>> ``` >>>> Thanks for the info which is really helpful for me to understand the >>>> problem. >>>> >>>>>>> balance_dirty_pages() is sleeping in KILLABLE state, so kill -9 of >>>>>>> the dd process should succeed. >>>>>> Yeah, dd can be killed, however it may be any application(s), :-) >>>>>> >>>>>> Fortunately it won't cause trouble during reboot/power off, given >>>>>> userspace will be killed at that time. >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Ming >>>>>> >>>>> Don't worry about that, we always set the current thread to TASK_KILLABLE >>>>> >>>>> while waiting in balance_dirty_pages(). >>>> I have another concern, if 'dd' isn't killed, dirty pages won't be cleaned, and >>>> these (big amount)memory becomes not usable, and typical scenario could be USB HDD >>>> unplugged. >>>> >>>> >>>> thanks, >>>> Ming >>> Yes, it is unreasonable to continue writing data with the previously opened >>> fd after >>> the file system becomes read-only, resulting in dirty page accumulation. >>> >>> I provided a patch in another reply. >>> Could you help test if it can solve your problem? >>> If it can indeed solve your problem, I will officially send it to the email >>> list. >> OK, I will test it tomorrow. > Your patch can avoid dd hang when bs is 512 at default, but if bs is > increased to 1G and more 'dd' tasks are started, the dd hang issue > still can be observed. Thank you for your testing! Yes, my patch only prevents the adding of new dirty pages, but it doesn't clear the dirty pages that already exist. The reason why it doesn't work after bs grows is that there are already enough dirty pages to trigger balance_dirty_pages(). Executing drop_caches at this point may make dd fail and exit. But the dirty pages are still not cleared, nor is the shutdown implemented by ext4. These dirty pages will not be cleared until the filesystem is unmounted. This is the result of my test at bs=512: ext4 -- remount read-only OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME 313872 313872 100%    0.10K   8048   39 32192K buffer_head   ---> wait to max 82602  82465  99%     0.10K   2118   39  8472K buffer_head   ---> kill dd && drop_caches 897    741    82%     0.10K     23   39    92K buffer_head   ---> umount patched: 25233  25051  99%    0.10K    647     39     2588K buffer_head > > The reason should be the next paragraph I posted. > > Another thing is that if remount read-only makes sense on one dead > disk? Yeah, block layer doesn't export such interface for querying > if bdev is dead. However, I think it is reasonable to export such > interface if FS needs that. Ext4 just detects I/O Error and remounts it as read-only, it doesn't know if the current disk is dead or not. I asked Yu Kuai and he said that disk_live() can be used to determine whether a disk has been removed based on the status of the inode corresponding to the block device, but this is generally not done in file systems. > >> But I am afraid if it can avoid the issue completely because the >> old write task hang in balance_dirty_pages() may still write/dirty pages >> if it is one very big size write IO. > > thanks, > Ming > Those dirty pages that are already there are piling up and can't be written back, which I think is a real problem. Can the block layer clear those dirty pages when it detects that the disk is deleted? -- With Best Regards, Baokun Li .