From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7959DC433F5 for ; Wed, 2 Mar 2022 03:39:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B93158D0002; Tue, 1 Mar 2022 22:39:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B41C98D0001; Tue, 1 Mar 2022 22:39:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A0A488D0002; Tue, 1 Mar 2022 22:39:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28]) by kanga.kvack.org (Postfix) with ESMTP id 9249D8D0001 for ; Tue, 1 Mar 2022 22:39:18 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5895D12A5 for ; Wed, 2 Mar 2022 03:39:18 +0000 (UTC) X-FDA: 79198040796.10.BC2FC14 Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) by imf27.hostedemail.com (Postfix) with ESMTP id 04CAD40002 for ; Wed, 2 Mar 2022 03:39:17 +0000 (UTC) Received: by mail-ej1-f45.google.com with SMTP id kt27so1065172ejb.0 for ; Tue, 01 Mar 2022 19:39:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to; bh=p8cJQgKKnOHkxJteaJVI28Wmpb4nsJV6bMQfmU2TxLU=; b=WruBke7ifrqm2rmf1V3w2WBkdmJK9KIBYhQ96AqMwMS/upKOhQVHlF+dLIUyTdjTDN rg5WJFWTbxqhIjKTTE6kK6Ll2iqPVY1uwgRVYcV+1t4nHIlOTrWdDF/TUDrh9TM0GtS0 GqiWlG/nHU+/ddEtBBAOGvz1CJ5UqpGw5Qhe/gPnwOnbT672gRk0A3ZX9P7IzkStOSFK 0njtBt4aO/VkDU4F+rlL1AtcbVxFPSnTbbXDgRciu0nCV80dSgQ3p2vRERshWBvfd8ND iHmiOheNBEISEDrMBY4kVE+uHOJGPZSkZJfnMne5uGnkd9rgWsPGxzOKbY2GmGFrvDFP lmOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=p8cJQgKKnOHkxJteaJVI28Wmpb4nsJV6bMQfmU2TxLU=; b=pCFAM+mU0nGy03dbD31u/c6/mn5C0qq0HYLzyZjj0GhW7iHtlU52YMuKH+q0MCokH3 7Ulagb2y+LkMyaBVP3UP++mBAQBQkwi2PL52izn66t226U8VwHcsfriPklE0g0SrcTbf IcpduRX/5H99Ujscw2yfKOLzdb6/Hm8F9VoWja1NnBOmP4L1HhlNdQO/hmHZbSYpSlTL LEgCANFQIWPM9zbGAFbtEDysANNfLfHvWGNUfTBFyPM3kWDmVVeg8RqoPM+AheszeS4s tCjSn2C/E8NcFQtcz3YeHJVSYZxnouI/zYniSVvBV+CFTYLYPG8xMavmR+w2VaQwVqHK U45Q== X-Gm-Message-State: AOAM530B57rU8PWazfw1AtIfzHI3dIs8ed3o0enSnVdrTCw+DaZEYMTk tQw9CWnANGIZa4yM2Fb0uNtbVS/2afu/JJjLpBs= X-Google-Smtp-Source: ABdhPJzG8DSs1ZZOpMGNy3UqImbtOTkMfdWoYSpGUoyRH1BJW4AGuV0PtKZMinDdmAtA863aZX1L900eEKbrXRAfE6E= X-Received: by 2002:a17:907:7e9c:b0:6d6:da36:35f4 with SMTP id qb28-20020a1709077e9c00b006d6da3635f4mr7073500ejc.764.1646192356557; Tue, 01 Mar 2022 19:39:16 -0800 (PST) MIME-Version: 1.0 From: Yang Shi Date: Tue, 1 Mar 2022 19:39:04 -0800 Message-ID: Subject: [LSF/MM/BPF TOPIC] Potential silent data loss caused by hwpoisoned page cache To: lsf-pc@lists.linux-foundation.org, Linux MM , Linux FS-devel Mailing List Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 04CAD40002 X-Stat-Signature: qwucwrsxz1w5rqzrr3hyia3mjuwq74bf Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=WruBke7i; spf=pass (imf27.hostedemail.com: domain of shy828301@gmail.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1646192357-464650 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: I did fill out the google form but forgot to email to the lists. When discussing the patch that splits page cache THP in order to offline the poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this from working since the page cache page will be truncated if uncorrectable errors happen. By looking this deeper it turns out this approach (truncating poisoned page) may incur silent data loss for all non-readonly filesystems if the page is dirty. And it may be worse for in-memory filesystem, e.g. shmem/tmpfs since the data blocks are actually gone. To solve this problem we could keep the poisoned dirty page in page cache then notify the users on any later access, e.g. page fault, read/write, etc. The clean page could be truncated as is since they can be reread from disk later on. The consequence is the filesystems may find poisoned page and manipulate it as healthy page since all the filesystems actually don't check if the page is poisoned or not in all the relevant paths except page fault. In general, we need to make the filesystems be aware of poisoned page before we could keep the poisoned page in page cache in order to solve the data loss problem. To make filesystems be aware of poisoned page we should consider: - The page should be not written back: clearing dirty flag could prevent from writeback. - The page should not be dropped (it shows as a clean page) by drop caches or other callers: the refcount pin from hwpoison could prevent from invalidating (called by cache drop, inode cache shrinking, etc), but it doesn't avoid invalidation in DIO path. - The page should be able to get truncated/hole punched/unlinked: it works as it is. - Notify users when the page is accessed, e.g. read/write, page fault and other paths (compression, encryption, etc). The scope of the last one is huge since almost all filesystems need to do it once a page is returned from page cache lookup by checking hwpoison flag for every possible path. This problem had been slightly discussed before, but seems no action was taken at that time. [2] I already converted shmem/tmpfs [3] since it seems more severe for in-memory filesystem. The hugetlbfs is in-memory filesystem as well, but it depends on double mapping support for hugeltbfs in order to set hwpoisoned flag on subpage per the discussion with hugetlb folks. Regular filesystems are definitely on the list as well. I understand the problem may be very rare, but it is quite subtle so even if data corruption is met it is very hard to root to it. So I'd like to suggest this topic. It is related to both MM and FS, although the heavy lift is on the FS side. [1] https://lore.kernel.org/linux-mm/CAHbLzkqNPBh_sK09qfr4yu4WTFOzRy+MKj+PA7iG-adzi9zGsg@mail.gmail.com/T/#m0e959283380156f1d064456af01ae51fdff91265 [2] https://lore.kernel.org/lkml/20210318183350.GT3420@casper.infradead.org/ [3] https://lore.kernel.org/linux-mm/20211020210755.23964-1-shy828301@gmail.com/