From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7459EC433E0 for ; Mon, 4 Jan 2021 21:52:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D24A82245C for ; Mon, 4 Jan 2021 21:52:51 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D24A82245C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CA6EC8D002F; Mon, 4 Jan 2021 16:52:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C573C8D002D; Mon, 4 Jan 2021 16:52:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B45278D002F; Mon, 4 Jan 2021 16:52:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 9AA928D002D for ; Mon, 4 Jan 2021 16:52:50 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 6891C180AD80F for ; Mon, 4 Jan 2021 21:52:50 +0000 (UTC) X-FDA: 77669442900.06.crack71_4a05c41274d3 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin06.hostedemail.com (Postfix) with ESMTP id 41E9210048EC8 for ; Mon, 4 Jan 2021 21:52:50 +0000 (UTC) X-HE-Tag: crack71_4a05c41274d3 X-Filterd-Recvd-Size: 9023 Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by imf42.hostedemail.com (Postfix) with ESMTP for ; Mon, 4 Jan 2021 21:52:49 +0000 (UTC) Received: by mail-lf1-f49.google.com with SMTP id x20so67931016lfe.12 for ; Mon, 04 Jan 2021 13:52:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=1WyRnE3FCyMn5lEtFjWaRD8K2XDmf+b9oA3QV7CY8n0=; b=SfyxRJvKNM1X28EwvAfTS7BZlF0oMAXEYwfZrBB9ih/I0v1qUWx2vTSSBmIPoIk3XN NerrTiIAdGQ0ZfU8R/ib1aYHbkJUfxS9U3NvMz8uQQLInIzJlc+BTUBFWQucO5kHsibS xgSpZ7zn5JJ1uoiJ9I0E3t4m6ycE+5i5MX4nw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=1WyRnE3FCyMn5lEtFjWaRD8K2XDmf+b9oA3QV7CY8n0=; b=mfVwqooZgsZvY799f6ncmZHkmweIJWZ8bY4kdDScATsdH6PF+diIwPv5F04a/0J9fh hVPSr3vHC96Ht3Rty6MPkMfxsE3Tx5RL7jxAhfFqdo1OM3xdFgzUP2EQgsQwXH/ccf9E 5w3dX/SETi76mNjLQPDIqewyV3EtipwzHloty1d88EPL59t/+CkqwISNfxsjBMQYm2QZ J57xehiId4caKEowtli3dsb6dm9HXWVLwAPVDAKM+Z0AVCddFP5wcEgB2r56B9lAi/4B 4iVprNg+O+09YG1m0VvuSemd5BXGeLv1HQwNQ5ybCxALymGasF4gGKlGFN2L2Jyatj2y 5jTg== X-Gm-Message-State: AOAM532wQMOozm/eAljbnSsemnuLNdyEMR9Z0pGk/m5rvmq0hJ466r2e 56dUdaBRu05hEXvl2cn8yNv1aRHI2W50Cw== X-Google-Smtp-Source: ABdhPJzY36VCBAEZMP9tBBOpBl1ZmvpsBZQuGr/NE1ebpoMaBRCzr0vBscHa/AGjpYqsUyPwOjJCEw== X-Received: by 2002:a2e:b558:: with SMTP id a24mr34800434ljn.328.1609797167327; Mon, 04 Jan 2021 13:52:47 -0800 (PST) Received: from mail-lf1-f41.google.com (mail-lf1-f41.google.com. [209.85.167.41]) by smtp.gmail.com with ESMTPSA id z26sm7422132lfi.54.2021.01.04.13.52.46 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 04 Jan 2021 13:52:46 -0800 (PST) Received: by mail-lf1-f41.google.com with SMTP id h205so67953864lfd.5 for ; Mon, 04 Jan 2021 13:52:46 -0800 (PST) X-Received: by 2002:a2e:b4af:: with SMTP id q15mr35051887ljm.507.1609797165713; Mon, 04 Jan 2021 13:52:45 -0800 (PST) MIME-Version: 1.0 References: <000000000000886dbd05b7ffa8db@google.com> <20210104124153.0992b1f7fd1a145e193a333f@linux-foundation.org> In-Reply-To: <20210104124153.0992b1f7fd1a145e193a333f@linux-foundation.org> From: Linus Torvalds Date: Mon, 4 Jan 2021 13:52:29 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: kernel BUG at mm/page-writeback.c:LINE! To: Andrew Morton , Hugh Dickins , Matthew Wilcox Cc: syzbot , Linux Kernel Mailing List , Linux-MM , syzkaller-bugs Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jan 4, 2021 at 12:41 PM Andrew Morton wrote: > > > > > kernel BUG at mm/page-writeback.c:2241! > > Call Trace: > > mpage_writepages+0xd8/0x230 fs/mpage.c:714 > > do_writepages+0xec/0x290 mm/page-writeback.c:2352 > > __filemap_fdatawrite_range+0x2a1/0x380 mm/filemap.c:422 > > fat_cont_expand+0x169/0x230 fs/fat/file.c:235 > > fat_setattr+0xac2/0xf40 fs/fat/file.c:501 > > notify_change+0xb60/0x10a0 fs/attr.c:336 > > do_truncate+0x134/0x1f0 fs/open.c:64 > > do_sys_ftruncate+0x703/0x860 fs/open.c:195 > > do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Well that's exciting. write_cache_pages() does: > > if (PageWriteback(page)) { > if (wbc->sync_mode != WB_SYNC_NONE) > wait_on_page_writeback(page); > else > goto continue_unlock; > } > > bang -->> BUG_ON(PageWriteback(page)); This is the same situation that was discussed earlier, and that we thought Hugh commit 073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)") should fix. Basically, the theory was that the sequence if (PageWriteback(page)) wait_on_page_writeback(page); BUG_ON(PageWriteback(page)); could have the wake-up of the _previous_ owner of the page happen, and wake up the wait_on_page_writeback() waiter, but then a new owner of the page would re-allocate it and mark it for writeback. > So either wait_on_page_writeback() simply failed to work (returned > without waiting) or someone came in and unexpectedly set PG_writeback a > second time. So Hugh found at least _one_ way that that "someone came in and unexpectedly set PG_writeback a second time" could happen. But that fix by Hugh is already in that HEAD commit that syzbot is now reporting, so there's something else going on. > Linus, how confident are you in those wait_on_page_bit_common() > changes? Pretty confident. The atomicity of the bitops themselves is fairly simple. But in the writeback bit? No. The old code would basically _loop_ if it was woken up and the writeback bit was set again, and would hide any problems with it. The new code basically goes "ok, the writeback bit was clear at one point, so I've waited enough". We could easily turn the "if ()" in wait_on_page_writeback() into a "while()". But honestly, it does smell to me like the bug is always in the caller not having serialized with whatever actually starts writeback. High figured out one such case. This code holds the page lock, but I don't see where set_page_writeback() would always be done with the page lock held. So what really protects against PG_writeback simply being set again? The whole BUG_ON() seems entirely buggy to me. In fact, even if you hold the page lock while doing set_page_writeback(), since the actual IO does *NOT* hold the page lock, the unlock happens without it. So even if every single case of setting the page writeback were to hold the page lock, what keeps this from happening: CPU1 = end previous writeback CPU2 = start new writeback under page lock CPU3 = write_cache_pages() CPU1 CPU2 CPU3 ---- ---- ---- end_page_writeback() test_clear_page_writeback(page) ... delayed... lock_page(); set_page_writeback() unlock_page() lock_page() wait_on_page_writeback(); wake_up_page(page, PG_writeback); .. wakes up CPU3 .. BUG_ON(PageWriteback(page)); IOW, that BUG_ON() really feels entirely bogus to me. Notice how it wasn't actually serialized with the waking up of the _previous_ bit. Could we make the wait_on_page_writeback() just loop if it sees the page under writeback again? Sure. Could we make the wait_on_page_bit_common() code say "if this is PG_writeback, I won't wake it up after all, because the bit is set again?" Sure. But I feel it's really that end_page_writeback() itself is fundamentally buggy, because the "wakeup" is not atomic with the bit clearing _and_ it doesn't actually hold the page lock that is allegedly serializing this all. That raciness was what caused the "stale wakeup from previous owner" thing too. And I think that Hugh fixed the page re-use case, but the fundamental problem of end_page_writeback() kind of remained. And yes, I think this was all hidden by wait_on_page_writeback() effectively looping over the "PageWriteback(page)" test because of how wait_on_page_bit() worked. So the one-liner of changing the "if" to "while" in wait_on_page_writeback() should get us back to what we used to do. Except I still get the feeling that the bug really is not in wait_on_page_writeback(), but in the end_page_writeback() side. Comments? I'm perfectly happy doing the one-liner. I would just be _happier_ with end_page_writeback() having the serialization.. The real problem is that "wake_up_page(page, bit)" is not the thing that actually clears the bit. So there's a fundamental race between clearing the bit and waking something up. Which makes me think that the best option would actually be to move the bit clearing _into_ wake_up_page(). But that looks like a very big change. Linus