From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A9FBE77179 for ; Fri, 6 Dec 2024 12:58:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9B3A56B00BB; Fri, 6 Dec 2024 07:58:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9635E6B00BD; Fri, 6 Dec 2024 07:58:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 804456B00C0; Fri, 6 Dec 2024 07:58:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5A3B96B00BB for ; Fri, 6 Dec 2024 07:58:28 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id EA0191C8832 for ; Fri, 6 Dec 2024 12:58:27 +0000 (UTC) X-FDA: 82864537098.19.6E938FB Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) by imf12.hostedemail.com (Postfix) with ESMTP id 9024B4001C for ; Fri, 6 Dec 2024 12:58:18 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AEXB5LIf; spf=pass (imf12.hostedemail.com: domain of amir73il@gmail.com designates 209.85.218.52 as permitted sender) smtp.mailfrom=amir73il@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733489893; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8F7HvsLxqO/BsNvoLEm5ZBXsSUQ9vSZoPpPNvMFdc9I=; b=oUbGnuDfqoRQvtYxXL4qQhkyvqAucD7UNJ13CVpJQv0gTbnbDZj5rQy1b/QTD9Hk+cALxQ 0xAIN031Gj/EkySRbVsuOuvgLJff5lTDnfRlpgTql0475swKIg6b2+FZ6iQDcvLZDwyVmk x6AZTTYmdXop/5cb0tB1lhw5ZntZzv4= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AEXB5LIf; spf=pass (imf12.hostedemail.com: domain of amir73il@gmail.com designates 209.85.218.52 as permitted sender) smtp.mailfrom=amir73il@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733489893; a=rsa-sha256; cv=none; b=iYTHpJT2iDp8DFAnaDxSy65sjO4vIJqGhejevPUNNnZoGr63WH0T1duUbudE8mFh1C1t46 hxxwMLjlwiFCcJHepBMgMrrAmFNctPpwDd7di8mnmRYm++YPBfiHcEQ9M8FFN68eI2FfAe q1Gt9vXiPO6jCxb2jGROkxMzYfasuis= Received: by mail-ej1-f52.google.com with SMTP id a640c23a62f3a-aa62f1d2b21so211848066b.1 for ; Fri, 06 Dec 2024 04:58:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1733489904; x=1734094704; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8F7HvsLxqO/BsNvoLEm5ZBXsSUQ9vSZoPpPNvMFdc9I=; b=AEXB5LIfdqn6QlKizthiD+9YJNwM3ZACXyneStiDYuN2tnh/36akMHANxbzLByKcqw IXK+lWYGCtTvPv4VEjwhvTXWijQOtrf9deY0kuNoffu9gCAWElou0m5bt6w4MwSBmpfY Wf5M9Oi/fQmrLZWkHULPiJg9F+KBIr0z5opKdUgrTWw+OCegNFe6awtQ65xWD93hAKhy 3K6cnyk5l2sRjK5Rjax1mkSnA7EYjULJZLfnj5otEyeCkQbUaiX9G1fl5L2Eo68qmg7b CuNkyJt4uw+ODVGXBOqPMncxCUGoXpjYd/HwOBGPqaYwlbfeAlZnikLIl1TlDBEXfFv0 LNFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733489904; x=1734094704; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8F7HvsLxqO/BsNvoLEm5ZBXsSUQ9vSZoPpPNvMFdc9I=; b=nFIGPJsbPI+KDvEIh42rJM3COhqbZBMWmKt1ShhDghlNLJULB+i0unE527w1PWXDWz GpaTR2YCebu8qqok7UnLVPxReOb5EzGmFzjfzrD0kwxQZk/j1eCEp7NIAuUvl0dcErHL TRyTcw9vcf1+F5TIK0T7wfqYdk4I4HtTR6ikyL+kLX40CHssq0S3w3FpSxnQiy/m90vm aSlt0VCzFZe/Wp5DpmzYh+yctxxJZLa+PERei+zKrdnSZYo5vDFDRwbww564cNtwgr5+ DEFA+whGGSFMuNNOkpVHgjak4W3zWT0aNem71aFZZhe+bPQV9r8nXsq5pHEbA/oUkalN M2Qw== X-Forwarded-Encrypted: i=1; AJvYcCU5w/O9tQ77SQbhdSioPXI5kidAL/lAEjOLadjU+rYae8p9enDKfQB9qI4XDIWgCzYA9JylQuW3JQ==@kvack.org X-Gm-Message-State: AOJu0YwhwjxeG/0vxG/nPRDdc5kVFeVXzrYYXiXJwZ94tdCMfGJ8l4jA bnYmC4B4ERxZWMq8OqVT6sfgflMY5RLPNfAr5PZ3xim4wTXbHx0YtUspi62x4gmndTzSsJ0DU+7 QzEghnqIwQQApGtaNh7go+9aqQQY= X-Gm-Gg: ASbGncukbodDlBNTOprXevJsu9FtdvZ/snLhw+Xb6IxT5z15eZbLI7l5jLwekMPx9In ZvV0ffAG5/rJ+8F9GfL0M+NAp1aq4IRQ= X-Google-Smtp-Source: AGHT+IF3tHBfl5kltrr2azKq6xvF6nb17womuIoRJUQHPcjAm6Az54Ihchm8UFkzk5wc5B0MCmF7nXHn5vohnmEgXY0= X-Received: by 2002:a17:906:32d0:b0:aa6:1c4b:9c5b with SMTP id a640c23a62f3a-aa639fa5cfamr165384766b.7.1733489903865; Fri, 06 Dec 2024 04:58:23 -0800 (PST) MIME-Version: 1.0 References: <20241205143038.3260233-1-tujinjiang@huawei.com> <69b72e3d-b101-4641-9ce5-51346c93a98d@lucifer.local> <041dcc1a-0630-27b9-661b-8c64a3775426@huawei.com> In-Reply-To: From: Amir Goldstein Date: Fri, 6 Dec 2024 13:58:12 +0100 Message-ID: Subject: Re: [PATCH -next] ovl: respect underlying filesystem's get_unmapped_area() To: Kefeng Wang Cc: Lorenzo Stoakes , Jinjiang Tu , miklos@szeredi.hu, akpm@linux-foundation.org, vbabka@suse.cz, jannh@google.com, linux-mm@kvack.org, linux-unionfs@vger.kernel.org, sunnanyong@huawei.com, yi.zhang@huawei.com, Matthew Wilcox Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9024B4001C X-Rspamd-Server: rspam12 X-Stat-Signature: aeicjzpdr4w1kajcs7zzrf89tnesp4zd X-Rspam-User: X-HE-Tag: 1733489898-427281 X-HE-Meta: U2FsdGVkX19wfAOJVJoR18gbKc51VD9sBw7Mo63Aoz8in/faxSiHQZnbBWwk3QO5poOFa4DuWX1B390L7GvZDPIHyITJTZprwXm8DiiUpoa/Omwti9XcQJBKsDOqzLmEnEC/l9lNXKbbkvfEDP20InAxeWXrajYryNdbZiT7Sk7ZCug2Ema7Cq0sRkcIGjRIOi2nG5BxBTggMh3pPkCNgBeTdg1DvEDN6LWbwgQ93WbApww/jSevrpcgmunOyrpYp6o7fC7phGRSh1c/bu1X1DUsQbruAQSY4PUhgt1D6hAMqlTQQJLpwxGDZN/8Jyeyfk8IA6s/oxgaz4yGU2LYLHcmpQLyMtTdDsSapE1eBZ29P2rUx8i/jbGkXxJWgu05uLrgN35kjKCDggQCahTebSUwnEx0DnK4wUJskY1goj//jroWU1F21EIbi+hW5O+uUx+tPD+mYkBrRQPamPxUxjMJlSKQ9E08P3/sGAI98hGDh4jt1txCkISZFQXpBrgpMifSrIVX3qu23sMpHKdQQaQRphFzhq4wHbtat5GHFQckqvb5kGMQT1zHWe9RSV91JpNNWzeSmrjqFKVf2ea7OCUZMAEDHketTFWv/ZTWSyM0DMT8R78A1wegRAZqENsUdyVxxA8vIKe09g0oftni02AVjfML4So+bXYC+NYywozLQaJT7r+0iWCHt9Fsmr1h9xTOgGCrIftwkV2egeSInLXimuDD/VxT6+UvFvpIoGmLiHuch3+W52/+m0Pret55U1PZH/vrU1Qkx/qLeshbqrDib5aY09m07BrUOW+BHk1cnS+ZBvIJIVl0wE/E8pnIwgS6oBuOU4ugnLRg0EhoiEAdOUvVoTngDPTWVLkBI5QRk56QZ4nZmsMUUQyR9VQk4Zdm882WvJVW3U8eTb/w16pO6wYEL+5Lp1JfDBWpbk7rlKtVcmKaVAXqqqvjySOmdDTOItZ1Xk6w1yutkST elLPrt4I Sh81OLA+kLIdpTTdSrrmKiLvFeggpuVSHXkvkPpsQMSXVGgZm/IDNHdw30sI4npIwLKz7rLv4L7/xr67LrreEeAihwx12ZGgEnGZIqaXiB/Qi4YQB985IcA8nrmczIIGkn5/078ciQ325VrAcZarn94nNc7olqCXeY+fNfySPm4lCCNXXbZPGpv3ynmXUEQYldHgDLDYULJzrOSyph6SuAb8TfE/2MMC1cZknukHGlrJyB4phU05YlCBVz56bwHvfdOu1icAeObD449C+IzNHRDBaNbvrWOjoICXWeYTpNgJSTIGm0Z5tFJ4Mzjd5G7wv067GrpaYTRsoIiC37D1kD6Mqp/0VoI6tvy8RVfw2mSkctazYNwxnrcM7KaKgq+HZeR9Bd+UTH8KB95Y= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Dec 6, 2024 at 11:45=E2=80=AFAM Kefeng Wang wrote: > > > > On 2024/12/6 17:25, Lorenzo Stoakes wrote: > > To be clear - I'm not accepting the export of __get_unmapped_area() so = if > > you depend on this for this approach, you can't take this approach. > > > > It's an internal implementation detail. That you choose to make your > > filesystem possibly a module doesn't mean that mm is required to export > > internal impl details to you. Sorry. > > > > To rescind this would require a very strong argument, you have not prov= ided > > it. > > > > On Fri, Dec 06, 2024 at 11:35:08AM +0800, Jinjiang Tu wrote: > >> > >> =E5=9C=A8 2024/12/5 23:04, Lorenzo Stoakes =E5=86=99=E9=81=93: > >>> + Matthew for large folio aspect > >>> > >>> On Thu, Dec 05, 2024 at 10:30:38PM +0800, Jinjiang Tu wrote: > >>>> During our tests in containers, there is a read-only file (i.e., sha= red > >>>> libraies) in the overlayfs filesystem, and the underlying filesystem= is > >>>> ext4, which supports large folio. We mmap the file with PROT_READ pr= ot, > >>>> and then call madvise(MADV_COLLAPSE) for it. However, the madvise ca= ll > >>>> fails and returns EINVAL. > >>>> > >>>> The reason is that the mapping address isn't aligned to PMD size. Si= nce > >>>> overlayfs doesn't support large folio, __get_unmapped_area() doesn't= call > >>>> thp_get_unmapped_area() to get a THP aligned address. > >>>> > >>>> To fix it, call get_unmapped_area() with the realfile. > >>> Isn't the correct solution to get overlayfs to support large folios? > >>> > >>>> Besides, since overlayfs may be built with CONFIG_OVERLAY_FS=3Dm, we= should > >>>> export get_unmapped_area(). > >>> Yeah, not in favour of this at all. This is an internal implementatio= n > >>> detail. It seems like you're trying to hack your way into avoiding > >>> providing support for large folios and to hand it off to the underlyi= ng > >>> file system. > >>> > >>> Again, why don't you just support large folios in overlayfs? > >>> > >>> Literally no other file system or driver appears to make use of this > >>> directly in this manner. > >>> > >>> And there's absolutely no way this should be exported non-GPL as if i= t were > >>> unavoidable core functionality that everyone needs. Only you seem to.= .. > >>> > >>>> Signed-off-by: Jinjiang Tu > >>>> --- > >>>> fs/overlayfs/file.c | 20 ++++++++++++++++++++ > >>>> mm/mmap.c | 1 + > >>>> 2 files changed, 21 insertions(+) > >>>> > >>>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > >>>> index 969b458100fe..d0dcf675ebe8 100644 > >>>> --- a/fs/overlayfs/file.c > >>>> +++ b/fs/overlayfs/file.c > >>>> @@ -653,6 +653,25 @@ static int ovl_flush(struct file *file, fl_owne= r_t id) > >>>> return err; > >>>> } > >>>> > >>>> +static unsigned long ovl_get_unmapped_area(struct file *file, > >>>> + unsigned long addr, unsigned long len, unsigned long pgof= f, > >>>> + unsigned long flags) > >>>> +{ > >>>> + struct file *realfile; > >>>> + const struct cred *old_cred; > >>>> + unsigned long ret; > >>>> + > >>>> + realfile =3D ovl_real_file(file); > >>>> + if (IS_ERR(realfile)) > >>>> + return PTR_ERR(realfile); > >>>> + > >>>> + old_cred =3D ovl_override_creds(file_inode(file)->i_sb); > >>>> + ret =3D get_unmapped_area(realfile, addr, len, pgoff, flags); > >>>> + ovl_revert_creds(old_cred); > >>> Why are you overriding credentials, then reinstating them here? That > >>> seems... iffy? I knew nothing about overlayfs so this may just be a > >>> misunderstanding... > >> > >> I refer to other file operations in overlayfs (i.e., ovl_fallocate, ba= cking_file_mmap). > >> Since get_unmapped_area() has security related operations (e.g., secur= ity_mmap_addr()), > >> We should call it with the cred of the underlying file. > >> > >>> > >>>> + > >>>> + return ret; > >>>> +} > >>>> + > >>>> const struct file_operations ovl_file_operations =3D { > >>>> .open =3D ovl_open, > >>>> .release =3D ovl_release, > >>>> @@ -661,6 +680,7 @@ const struct file_operations ovl_file_operations= =3D { > >>>> .write_iter =3D ovl_write_iter, > >>>> .fsync =3D ovl_fsync, > >>>> .mmap =3D ovl_mmap, > >>>> + .get_unmapped_area =3D ovl_get_unmapped_area, > >>>> .fallocate =3D ovl_fallocate, > >>>> .fadvise =3D ovl_fadvise, > >>>> .flush =3D ovl_flush, > >>>> diff --git a/mm/mmap.c b/mm/mmap.c > >>>> index 16f8e8be01f8..60eb1ff7c9a8 100644 > >>>> --- a/mm/mmap.c > >>>> +++ b/mm/mmap.c > >>>> @@ -913,6 +913,7 @@ __get_unmapped_area(struct file *file, unsigned = long addr, unsigned long len, > >>>> error =3D security_mmap_addr(addr); > >>>> return error ? error : addr; > >>>> } > >>>> +EXPORT_SYMBOL(__get_unmapped_area); > >>> We'll need a VERY good reason to export this internal implementation > >>> detail, and if that were provided we'd need a VERY good reason for it= not > >>> to be GPL. > >>> > >>> This just seems to be a cheap way of invoking (), > >>> maybe, if it is being used by the underlying file system. > >> > >> But the underlying file system may not support large folio. In this ca= se, > >> the mmap address doesn't need to be aligned with THP size. > > > > But it'd not cause any problems to just do that anyway right? I don't t= hink > > many people think 'oh no I have a PMD aligned mapping now what will I d= o'? > > > > Again - the right solution here is to handle large folios in overlayfs = as > > far as I can tell. > > I think this is not to handle large folio for overlayfs, it is about vma > alignment or vma allocation for memory mapped files, > > > 1) many fs support THP mapping, using thp_get_unmapped_area(), > > fs/bcachefs/fs.c: .get_unmapped_area =3D thp_get_unmapped_area, > fs/btrfs/file.c: .get_unmapped_area =3D thp_get_unmapped_area, > fs/erofs/data.c: .get_unmapped_area =3D thp_get_unmapped_area, > fs/ext2/file.c: .get_unmapped_area =3D thp_get_unmapped_area, > fs/ext4/file.c: .get_unmapped_area =3D thp_get_unmapped_area, > fs/fuse/file.c: .get_unmapped_area =3D thp_get_unmapped_area, > fs/xfs/xfs_file.c: .get_unmapped_area =3D thp_get_unmapped_area, > > 2) and some fs has own get_unmapped_area callback too, > > fs/cramfs/inode.c: .get_unmapped_area =3D cramfs_physmem_get_un= mapped_area, > fs/hugetlbfs/inode.c: .get_unmapped_area =3D hugetlb_get_unmapped_= area, > fs/ramfs/file-mmu.c: .get_unmapped_area =3D ramfs_mmu_get_unmappe= d_area, > fs/ramfs/file-nommu.c: .get_unmapped_area =3D ramfs_nommu_get_unmap= ped_area, > fs/romfs/mmap-nommu.c: .get_unmapped_area =3D romfs_get_unmapped_ar= ea, > mm/shmem.c: .get_unmapped_area =3D shmem_get_unmapped_area, > > They has own rules to get a vma area, but with overlayfs(tries to > present a filesystem which is the result over overlaying one filesystem > on top of the other), we now only use the default > mm_get_unmapped_area_vmflags() to get a vma area, since the overlayfs > has no '.get_unmapped_area' callback. > > do_mmap > __get_unmapped_area > // get_area =3D NULL > mm_get_unmapped_area_vmflags > mmap_region > mmap_file > ovl_mmap > > It looks wrong, so we need to get the readfile via ovl_real_file() > and use realfile' get_unmapped_area callback, and if the realfile > is not with the callback, fallback to the default > mm_get_unmapped_area(), > > > > > In any case as per the above, we're just not exporting > > __get_unmapped_area(), sorry. > > > > So maybe use mm_get_unmapped_area() instead of __get_unmapped_area(), > something like below, > > +static unsigned long ovl_get_unmapped_area(struct file *file, > + unsigned long addr, unsigned long len, unsigned long pgof= f, > + unsigned long flags) > +{ > + struct file *realfile; > + const struct cred *old_cred; > + > + realfile =3D ovl_real_file(file); > + if (IS_ERR(realfile)) > + return PTR_ERR(realfile); > + > + if (realfile->f_op->get_unmapped_area) { > + unsigned long ret; > + > + old_cred =3D ovl_override_creds(file_inode(file)->i_sb); > + ret =3D realfile->f_op->get_unmapped_area(realfile, addr,= len, > + pgoff, flags); > + ovl_revert_creds(old_cred); > + > + if (ret) > + return ret; > + } > + > + return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, > flags); > +} > > Correct me If I'm wrong. > You just need to be aware of the fact that between ovl_get_unmapped_area() and ovl_mmap(), ovl_real_file(file) could change from the lower file, to th= e upper file due to another operation that initiated copy-up. Thanks, Amir.