From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F67EE77179 for ; Fri, 6 Dec 2024 10:45:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CE4B76B020F; Fri, 6 Dec 2024 05:45:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C6DFD6B0213; Fri, 6 Dec 2024 05:45:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B0E406B0246; Fri, 6 Dec 2024 05:45:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 902176B020F for ; Fri, 6 Dec 2024 05:45:19 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 450FD81A89 for ; Fri, 6 Dec 2024 10:45:19 +0000 (UTC) X-FDA: 82864201350.09.0B6B74B Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf24.hostedemail.com (Postfix) with ESMTP id 99D98180016 for ; Fri, 6 Dec 2024 10:45:14 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf24.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733481909; a=rsa-sha256; cv=none; b=3p2HwaqP/4dHOlSJDEsEuAjcZ537TbPTPZc8sknCadNX4IBiWsuey+ecxGHa9ZOSlGpSuZ Q0widLKaNG1vkaTokZjPZaHzm1n1oISKQ4QpNcp4qw54Kg8XbO6YBRJ3ANggF93Jn+1yCz RecVT6EmE3eAJ+5rMvhb1dgA1VUg44Y= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf24.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733481909; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8U3WYDuQKgCumAJd/1gGA9j0bWLLbT2H1bGrr2YdOlk=; b=7BKd47P0/2baLczy07MZR60fC45jmi4Xom9xM1u6An+yPICwhP415Jm/mFHb5GEJbbH4Z8 j5eHydkBRC46MwsjPV5T1Xv6ONFVog63+j+VGUG4BBzOxl5/wilDf4lJ3kf9zUADeYO2ds 8flY6xHrrTBz787l6v2+RKKmMpm9rUg= Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4Y4SW94n7bz1V5jT; Fri, 6 Dec 2024 18:42:13 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id C19EE1402E0; Fri, 6 Dec 2024 18:45:12 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 6 Dec 2024 18:45:12 +0800 Message-ID: Date: Fri, 6 Dec 2024 18:45:11 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH -next] ovl: respect underlying filesystem's get_unmapped_area() To: Lorenzo Stoakes , Jinjiang Tu CC: , , , , , , , , , Matthew Wilcox References: <20241205143038.3260233-1-tujinjiang@huawei.com> <69b72e3d-b101-4641-9ce5-51346c93a98d@lucifer.local> <041dcc1a-0630-27b9-661b-8c64a3775426@huawei.com> Content-Language: en-US From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 99D98180016 X-Stat-Signature: qws9etgn51pwbjydeuw84nh8fewcjbax X-HE-Tag: 1733481914-163217 X-HE-Meta: U2FsdGVkX18Kv+dZ4SvjyCxyPUAHkwcGhBFksrjvJwxrj3x0XkRPQVolP10L1kVWChVpnPC0HV3EL07WEZGsyCN7nVA2p1TvIB9rYQvPlO6VCUWwL0bUk0n9en7j09skiy9tpZKSerns8n9RqoSU4xJGVvOxQ9r6o3sytN0+54+byNKnFkLuPrX6uIbB0Cc6UKrMfV2SkC2ZqyAAV/DCEB+0lm5wJECWVuUcTwuIlrlFlAYFA1kMRnQKU1b7+U7WUoPqukEOZI8fGy8XggZX7GsBUNTXlL+HuZkF1lHTcZjzpce7H6fU7aeD9gmcDnemgzaVGrTVNQIyk+YRhFz4CdHAbf0tv1YB9jxZPaTknTFVMvLjY65qmstH1aH75D2uJhBpag87vQk/6cHW0nBoR6RfMYznZNb0wCQpJ4KwHRTgkPbuQm/l13ja+G/Fx4I+a5grdLuDN2XrSJ7L1xPjmLXyxXB3FjIWRFxIFAdKsHPMKvwYiG7D2t5dfoZjjIyXgMl3IlfUaQKnOoIQrAYOYgmP9Xzj6qABEvsV1MqXSiVmN+4e6epVJ9cb9B3kErVVuRwRkCOCZKwpbn4tA4dwTu+6eAT7T53b8EUOdh0ajv8M2JCYmZsb4OBDn0b6tXe1qWLZ0z29KI8PPqb1DVUWvf0LvZNKdPd92TVH0g3WFAsUci2jED/fDPq5Na6TyP9rFWhuigVxs7xSZWx3IdM+nDPlGWZb/KeSYj9klEqQoaeLX5tBRSzplLcZSvxeirMO+Z1O2VzId9h83p44lJi3U6zxmN5rSqaK1wAN/ry6a3clWKqmWO6tl6Yn9Il95T7l6SGYZ//D+7PIt6p198vLRLghyoSbdhSTfn8+ahwhecHwMYfW0trCwXOlTS8MvAihZvUO0hDf98H4pxfZnSK3koD/YKFDXzrJkdMXMxWXb4Igklm3VOJU+mzTqeUa2C4Z8szziA/QADRwFdrBefH 7pYv5swF LvBECY/BZOtwCVKajF7A25TpKgyNsR2CTJ57wWWhOpjuO1Wr2I/T2Plv6y++HeiwZYJg9mP/VZcoZX1Sr4zs9PdYwUUORRXaaZFU7lSnZ1E4uoETg01Srjuz5VplD0zgI5OWmW+67JLaY0g2NhM7/sTmMLe4B+AfLMhyIYde8DqW4bJ5v60hS6pFuM6fDfuV+I6FRqcm4ijK5mncwbgqirev3C8TPPxX84nFEj6nMaO+uEcwMtxsOPXh7qZvvl1XaZPAFoRQ+UbXundY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/12/6 17:25, Lorenzo Stoakes wrote: > To be clear - I'm not accepting the export of __get_unmapped_area() so if > you depend on this for this approach, you can't take this approach. > > It's an internal implementation detail. That you choose to make your > filesystem possibly a module doesn't mean that mm is required to export > internal impl details to you. Sorry. > > To rescind this would require a very strong argument, you have not provided > it. > > On Fri, Dec 06, 2024 at 11:35:08AM +0800, Jinjiang Tu wrote: >> >> 在 2024/12/5 23:04, Lorenzo Stoakes 写道: >>> + Matthew for large folio aspect >>> >>> On Thu, Dec 05, 2024 at 10:30:38PM +0800, Jinjiang Tu wrote: >>>> During our tests in containers, there is a read-only file (i.e., shared >>>> libraies) in the overlayfs filesystem, and the underlying filesystem is >>>> ext4, which supports large folio. We mmap the file with PROT_READ prot, >>>> and then call madvise(MADV_COLLAPSE) for it. However, the madvise call >>>> fails and returns EINVAL. >>>> >>>> The reason is that the mapping address isn't aligned to PMD size. Since >>>> overlayfs doesn't support large folio, __get_unmapped_area() doesn't call >>>> thp_get_unmapped_area() to get a THP aligned address. >>>> >>>> To fix it, call get_unmapped_area() with the realfile. >>> Isn't the correct solution to get overlayfs to support large folios? >>> >>>> Besides, since overlayfs may be built with CONFIG_OVERLAY_FS=m, we should >>>> export get_unmapped_area(). >>> Yeah, not in favour of this at all. This is an internal implementation >>> detail. It seems like you're trying to hack your way into avoiding >>> providing support for large folios and to hand it off to the underlying >>> file system. >>> >>> Again, why don't you just support large folios in overlayfs? >>> >>> Literally no other file system or driver appears to make use of this >>> directly in this manner. >>> >>> And there's absolutely no way this should be exported non-GPL as if it were >>> unavoidable core functionality that everyone needs. Only you seem to... >>> >>>> Signed-off-by: Jinjiang Tu >>>> --- >>>> fs/overlayfs/file.c | 20 ++++++++++++++++++++ >>>> mm/mmap.c | 1 + >>>> 2 files changed, 21 insertions(+) >>>> >>>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c >>>> index 969b458100fe..d0dcf675ebe8 100644 >>>> --- a/fs/overlayfs/file.c >>>> +++ b/fs/overlayfs/file.c >>>> @@ -653,6 +653,25 @@ static int ovl_flush(struct file *file, fl_owner_t id) >>>> return err; >>>> } >>>> >>>> +static unsigned long ovl_get_unmapped_area(struct file *file, >>>> + unsigned long addr, unsigned long len, unsigned long pgoff, >>>> + unsigned long flags) >>>> +{ >>>> + struct file *realfile; >>>> + const struct cred *old_cred; >>>> + unsigned long ret; >>>> + >>>> + realfile = ovl_real_file(file); >>>> + if (IS_ERR(realfile)) >>>> + return PTR_ERR(realfile); >>>> + >>>> + old_cred = ovl_override_creds(file_inode(file)->i_sb); >>>> + ret = get_unmapped_area(realfile, addr, len, pgoff, flags); >>>> + ovl_revert_creds(old_cred); >>> Why are you overriding credentials, then reinstating them here? That >>> seems... iffy? I knew nothing about overlayfs so this may just be a >>> misunderstanding... >> >> I refer to other file operations in overlayfs (i.e., ovl_fallocate, backing_file_mmap). >> Since get_unmapped_area() has security related operations (e.g., security_mmap_addr()), >> We should call it with the cred of the underlying file. >> >>> >>>> + >>>> + return ret; >>>> +} >>>> + >>>> const struct file_operations ovl_file_operations = { >>>> .open = ovl_open, >>>> .release = ovl_release, >>>> @@ -661,6 +680,7 @@ const struct file_operations ovl_file_operations = { >>>> .write_iter = ovl_write_iter, >>>> .fsync = ovl_fsync, >>>> .mmap = ovl_mmap, >>>> + .get_unmapped_area = ovl_get_unmapped_area, >>>> .fallocate = ovl_fallocate, >>>> .fadvise = ovl_fadvise, >>>> .flush = ovl_flush, >>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>> index 16f8e8be01f8..60eb1ff7c9a8 100644 >>>> --- a/mm/mmap.c >>>> +++ b/mm/mmap.c >>>> @@ -913,6 +913,7 @@ __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, >>>> error = security_mmap_addr(addr); >>>> return error ? error : addr; >>>> } >>>> +EXPORT_SYMBOL(__get_unmapped_area); >>> We'll need a VERY good reason to export this internal implementation >>> detail, and if that were provided we'd need a VERY good reason for it not >>> to be GPL. >>> >>> This just seems to be a cheap way of invoking (), >>> maybe, if it is being used by the underlying file system. >> >> But the underlying file system may not support large folio. In this case, >> the mmap address doesn't need to be aligned with THP size. > > But it'd not cause any problems to just do that anyway right? I don't think > many people think 'oh no I have a PMD aligned mapping now what will I do'? > > Again - the right solution here is to handle large folios in overlayfs as > far as I can tell. I think this is not to handle large folio for overlayfs, it is about vma alignment or vma allocation for memory mapped files, 1) many fs support THP mapping, using thp_get_unmapped_area(), fs/bcachefs/fs.c: .get_unmapped_area = thp_get_unmapped_area, fs/btrfs/file.c: .get_unmapped_area = thp_get_unmapped_area, fs/erofs/data.c: .get_unmapped_area = thp_get_unmapped_area, fs/ext2/file.c: .get_unmapped_area = thp_get_unmapped_area, fs/ext4/file.c: .get_unmapped_area = thp_get_unmapped_area, fs/fuse/file.c: .get_unmapped_area = thp_get_unmapped_area, fs/xfs/xfs_file.c: .get_unmapped_area = thp_get_unmapped_area, 2) and some fs has own get_unmapped_area callback too, fs/cramfs/inode.c: .get_unmapped_area = cramfs_physmem_get_unmapped_area, fs/hugetlbfs/inode.c: .get_unmapped_area = hugetlb_get_unmapped_area, fs/ramfs/file-mmu.c: .get_unmapped_area = ramfs_mmu_get_unmapped_area, fs/ramfs/file-nommu.c: .get_unmapped_area = ramfs_nommu_get_unmapped_area, fs/romfs/mmap-nommu.c: .get_unmapped_area = romfs_get_unmapped_area, mm/shmem.c: .get_unmapped_area = shmem_get_unmapped_area, They has own rules to get a vma area, but with overlayfs(tries to present a filesystem which is the result over overlaying one filesystem on top of the other), we now only use the default mm_get_unmapped_area_vmflags() to get a vma area, since the overlayfs has no '.get_unmapped_area' callback. do_mmap __get_unmapped_area // get_area = NULL mm_get_unmapped_area_vmflags mmap_region mmap_file ovl_mmap It looks wrong, so we need to get the readfile via ovl_real_file() and use realfile' get_unmapped_area callback, and if the realfile is not with the callback, fallback to the default mm_get_unmapped_area(), > > In any case as per the above, we're just not exporting > __get_unmapped_area(), sorry. > So maybe use mm_get_unmapped_area() instead of __get_unmapped_area(), something like below, +static unsigned long ovl_get_unmapped_area(struct file *file, + unsigned long addr, unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + struct file *realfile; + const struct cred *old_cred; + + realfile = ovl_real_file(file); + if (IS_ERR(realfile)) + return PTR_ERR(realfile); + + if (realfile->f_op->get_unmapped_area) { + unsigned long ret; + + old_cred = ovl_override_creds(file_inode(file)->i_sb); + ret = realfile->f_op->get_unmapped_area(realfile, addr, len, + pgoff, flags); + ovl_revert_creds(old_cred); + + if (ret) + return ret; + } + + return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags); +} Correct me If I'm wrong. Thanks. >> >>> >>> And again... why not just add large folio support? We can't just take a >>> hack here. >>> >>>> unsigned long >>>> mm_get_unmapped_area(struct mm_struct *mm, struct file *file, >>>> -- >>>> 2.34.1 >>>> >