From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EAEEAC4345F for ; Mon, 6 May 2024 03:50:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D4E06B007B; Sun, 5 May 2024 23:50:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 45DFD6B0082; Sun, 5 May 2024 23:50:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2D7666B0083; Sun, 5 May 2024 23:50:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0C8D36B007B for ; Sun, 5 May 2024 23:50:05 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 70E5E1603A1 for ; Mon, 6 May 2024 03:50:04 +0000 (UTC) X-FDA: 82086592728.26.207BD83 Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) by imf12.hostedemail.com (Postfix) with ESMTP id 44BEE40004 for ; Mon, 6 May 2024 03:49:59 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf12.hostedemail.com: domain of yi.zhang@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=yi.zhang@huaweicloud.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714967402; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zTbecyGWsBxMVPLOMLaBjY8vD/svoDJkWMoEu9+4sw8=; b=XfmVavlie26Zo7VYZTV1lhaBnFjIpIPg+N+2BfsVCOzYAr2CYZt/TOkPbtP9gIbG4yUqmi SPThzsuece+z+ymKVQgR+vG1nyRftxix6mkVCjYnCFU7f1uWRRi/7YZYKjtGfw1ViHAcF4 KpKc7WdsNWsh1ZX72aUtHVijsUKOZ5I= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714967402; a=rsa-sha256; cv=none; b=IN2uhxCU5+49TGLHauwOklmOw7qcDtGP9aSWBMAqBVhO7Gdwlw13syE9AJ6RzEeabg92bv ir9Rw7Ysv7wRXuaxjxTa242m8IFc0Br997jMa78XjBkNQzuQd4KNFNT+XVbAFfXi1YUwPq MBHN5z2rxNpi+rPYGI2/0jFjLUk6nTo= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf12.hostedemail.com: domain of yi.zhang@huaweicloud.com designates 45.249.212.51 as permitted sender) smtp.mailfrom=yi.zhang@huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.163.235]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4VXnV10R1yz4f3nTx for ; Mon, 6 May 2024 11:49:45 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.75]) by mail.maildlp.com (Postfix) with ESMTP id C64591A0568 for ; Mon, 6 May 2024 11:49:54 +0800 (CST) Received: from [10.174.179.80] (unknown [10.174.179.80]) by APP2 (Coremail) with SMTP id Syh0CgBHaw5gUzhmcjOsMA--.8663S3; Mon, 06 May 2024 11:49:54 +0800 (CST) Subject: Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block To: "Ritesh Harjani (IBM)" , Dave Chinner Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, hch@infradead.org, djwong@kernel.org, willy@infradead.org, zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com, wangkefeng.wang@huawei.com References: <87a5l8am4k.fsf@gmail.com> From: Zhang Yi Message-ID: <7fa1a8da-f335-b8b1-bfb6-fae88f20d598@huaweicloud.com> Date: Mon, 6 May 2024 11:49:52 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: <87a5l8am4k.fsf@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-CM-TRANSID:Syh0CgBHaw5gUzhmcjOsMA--.8663S3 X-Coremail-Antispam: 1UD129KBjvJXoWxGr1DAFWUWFWfGFWfKF4xtFb_yoWrtFy5pr W3C3WUKrZrGr4UAwn2qw1kJFyjg3y8GrW7JrsYgr1jvF9IgFyaq3W2qw1j9FZayr4xJF1j vw4jqF9rZ3W5ZaDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUvIb4IE77IF4wAFF20E14v26ryj6rWUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_tr0E3s1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7Mxk0xIA0c2IE e2xFo4CEbIxvr21l42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxV Aqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r4a 6rW5MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6x kF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWrZr1j6s0DMIIF0xvEx4A2jsIE 14v26r1j6r4UMIIF0xvEx4A2jsIEc7CjxVAFwI0_Gr0_Gr1UYxBIdaVFxhVjvjDU0xZFpf 9x07UZ18PUUUUU= X-CM-SenderInfo: d1lo6xhdqjqx5xdzvxpfor3voofrz/ X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 44BEE40004 X-Stat-Signature: kdductpnp6d76qwycixgpdr4o7re3j88 X-Rspam-User: X-HE-Tag: 1714967399-788833 X-HE-Meta: U2FsdGVkX1+xe32bUUncG4GJcK9xgKrKAR5w7uJZ/tyDxFJth4w5RFcf9mQZ2ybti47aHOvG0aJcxWHAzjLCIMplwpr4Ud/KE7iVTbcyy3ZfsoJCZHLODKvKtC01Id9N0xBVN4wjP8zZSjyY86SJLBeroXyZs1r3NUI9/hrucfzLP4ybvzhaU/SzpHruRdpgUFfiDDcI4hZ6hnyZemF5F2QSSb+4Qf2qMCKWHa+9dnGpoTewmx8uV19AiRIHSQWs0HH/pZhMP1kij4DQshzUXr9d1/VOGXe4L9+bCaQmk/nmWapeELSym1nXlNNbeTE4DtCIs9Fphx3gnfjOrOfPwND5eHRu9f7/mq8ycAGEFDb/ET6UEZY3VxkAXD0VX7CaxOeWB0QAj0zb6lIrQrOyGumXJxe4zQvDvPYrul42v2Qo8wLMX1fpTEx2cMF4RBz586X+JQ3LfdU9TW74Q3p09zDPX1nUfuMhkREK4n/JhEAaHV4VrmUjzrMP3e6/8SvERi/ipQgOr9MDcjotP5gqPykEKTfX2lBMRK9+CrCh5NCFUR4V2yUB5QSYITVgadhmcLGoWzZSvXnrbBVOWLVxx9DnMr5wVkUjbKXyr1RQnsUkF1192OAxMwWDs6j7iZbUUmby3fC1ct3QM8tmOcUYkq8IeD2xSrVfthMh/qfIjpIhb/28vkxvgnThN5oYzKAS+0M9OK4z8o0GSkPbCP+ZF0zHzmNHLSSRConKGsQyh9ShHpJvQ4DvwqfYEyqEZQJd9OIeAQOPV/aXhsnHa1spJuU6CxZMqr5Z9qxuT4cNGyxrxFnkmZoUyRL8eS6C/3qpg7KjvKdKwzuk94/GvqXym/58dI+mLAJ+vnsiWjTzZCK8Ao12rveJAlchL7/uPLNGgJY+Y8y9u4vKvT/01m/HZteaVeHPhTpAptROVVTALpAqrCgiBjo2ZKupUSgDpe94F+ArTXFDXICNEATPq8M V/9hB6F/ W+6RNAX1rsNbyEMCh2Ev8E3QOHGZyUbqOADHWiDBFTJVXE5FGejaAwIU4Wrh2NUOCf5foRkZu2UcYTjr++DOJXrLKeF9Jw0aP2JtX+xY3Vl5IifOy3B40ek5MJ0OFrRAYDzlNeZJr6MyX7M1A7mQLVOAoTpmiGSezrx/XZw0EOCJ4Pn3AzQuMjuC6jqeWcEoITfeTSUM4QxFiFemiWlPYgMAQMGrx/Q8lFhHwHNAewCJfhrc3TrJHXHKdEwqyS6vN9QI4wLcO2zwFpjbQ3mN1ANGAlw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/5/2 12:11, Ritesh Harjani (IBM) wrote: > Dave Chinner writes: > >> On Wed, May 01, 2024 at 05:49:50PM +0530, Ritesh Harjani wrote: >>> Dave Chinner writes: >>> >>>> On Wed, Apr 10, 2024 at 10:29:16PM +0800, Zhang Yi wrote: >>>>> From: Zhang Yi >>>>> >>>>> Now we lookup extent status entry without holding the i_data_sem before >>>>> inserting delalloc block, it works fine in buffered write path and >>>>> because it holds i_rwsem and folio lock, and the mmap path holds folio >>>>> lock, so the found extent locklessly couldn't be modified concurrently. >>>>> But it could be raced by fallocate since it allocate block whitout >>>>> holding i_rwsem and folio lock. >>>>> >>>>> ext4_page_mkwrite() ext4_fallocate() >>>>> block_page_mkwrite() >>>>> ext4_da_map_blocks() >>>>> //find hole in extent status tree >>>>> ext4_alloc_file_blocks() >>>>> ext4_map_blocks() >>>>> //allocate block and unwritten extent >>>>> ext4_insert_delayed_block() >>>>> ext4_da_reserve_space() >>>>> //reserve one more block >>>>> ext4_es_insert_delayed_block() >>>>> //drop unwritten extent and add delayed extent by mistake >>>> >>>> Shouldn't this be serialised by the file invalidation lock? Hole >>>> punching via fallocate must do this to avoid data use-after-free >>>> bugs w.r.t racing page faults and all the other fallocate ops need >>>> to serialise page faults to avoid page cache level data corruption. >>>> Yet here we see a problem resulting from a fallocate operation >>>> racing with a page fault.... >>> >>> IIUC, fallocate operations which invalidates the page cache contents needs >>> to take th invalidate_lock in exclusive mode to prevent page fault >>> operations from loading pages for stale mappings (blocks which were >>> marked free might get reused). This can cause stale data exposure. >>> >>> Here the fallocate operation require allocation of unwritten extents and >>> does not require truncate of pagecache range. So I guess, it is not >>> strictly necessary to hold the invalidate lock here. >> >> True, but you can make exactly the same argument for write() vs >> fallocate(). Yet this path in ext4_fallocate() locks out >> concurrent write()s and waits for DIOs in flight to drain. What >> makes buffered writes triggered by page faults special? >> >> i.e. if you are going to say "we don't need serialisation between >> writes and fallocate() allocating unwritten extents", then why is it >> still explicitly serialising against both buffered and direct IO and >> not just truncate and other fallocate() operations? >> >>> But I see XFS does take IOLOCK_EXCL AND MMAPLOCK_EXCL even for this operation. >> >> Yes, that's the behaviour preallocation has had in XFS since we >> introduced the MMAPLOCK almost a decade ago. This was long before >> the file_invalidation_lock() was even a glimmer in Jan's eye. >> >> btrfs does the same thing, for the same reasons. COW support makes >> extent tree manipulations excitingly complex at times... >> >>> I guess we could use the invalidate lock for fallocate operation in ext4 >>> too. However, I think we still require the current patch. The reason is >>> ext4_da_map_blocks() call here first tries to lookup the extent status >>> cache w/o any i_data_sem lock in the fastpath. If it finds a hole, it >>> takes the i_data_sem in write mode and just inserts an entry into extent >>> status cache w/o re-checking for the same under the exclusive lock. >>> ...So I believe we still should have this patch which re-verify under >>> the write lock if whether any other operation has inserted any entry >>> already or not. >> >> Yup, I never said the code in the patch is wrong or unnecessary; I'm >> commenting on the high level race condition that lead to the bug >> beting triggered. i.e. that racing data modification operations with >> low level extent manipulations is often dangerous and a potential >> source of very subtle, hard to trigger, reproduce and debug issues >> like the one reported... >> > > Yes, thanks for explaining and commenting on the high level design. > It was indeed helpful. And I agree with your comment on, we can refactor > out the common operations from fallocate path and use invalidate lock to > protect against data modification (page fault) and extent manipulation > path (fallocate operations). > Yeah, thanks for explanation and suggestion, too. After looking at your discussion, I also suppose we could refactor a common helper and use the file invalidation lock for the whole ext4 fallocate path, current code is too scattered. Thanks, Yi.