From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 700AFE7717F for ; Tue, 10 Dec 2024 17:37:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 020706B010C; Tue, 10 Dec 2024 12:37:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F12216B010E; Tue, 10 Dec 2024 12:37:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB34C6B0112; Tue, 10 Dec 2024 12:37:09 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B36F76B010C for ; Tue, 10 Dec 2024 12:37:09 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 664D31A0C80 for ; Tue, 10 Dec 2024 17:37:09 +0000 (UTC) X-FDA: 82879754076.09.0376B58 Received: from mail-ed1-f45.google.com (mail-ed1-f45.google.com [209.85.208.45]) by imf23.hostedemail.com (Postfix) with ESMTP id D810014002B for ; Tue, 10 Dec 2024 17:36:52 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="g/Rec1Dw"; spf=pass (imf23.hostedemail.com: domain of jannh@google.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733852205; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6wzu7X1ZjViJF0SXMKQho06IRqPzYv6DxoYltDYvQnk=; b=jBBf8nI2hnME+1lMP2ln8hb46M7Lom71sv9iyby4LRHAloz+QO88/hbvQdrOxjuHbQiYZQ HnN0zFPr5Wkmfc+uZrHvtbwKXwhUo75O9gRR7I0PeAPBOe5kC/X+n7Pi+wIO0X+t5U7tEp hL94+5e88RXy7tYCt9fLtGKg0KkNEk0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733852205; a=rsa-sha256; cv=none; b=Hf7o7rlKQZ9hjprXe72BQmIAD7pXUEz6H68Peb4k3kEzbdJFEoPoJRdfdIvE/1gonR+UBY N/UswPR+LqaMnu7djKcfXLMfx3VNzoDXwwc9rs802ErfgZyWrqCM/PoPFjJeImldmptV41 FphQg2VyPNVwqgHzO0owGaR5q+fGBuU= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="g/Rec1Dw"; spf=pass (imf23.hostedemail.com: domain of jannh@google.com designates 209.85.208.45 as permitted sender) smtp.mailfrom=jannh@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ed1-f45.google.com with SMTP id 4fb4d7f45d1cf-5d3e638e1b4so9805a12.1 for ; Tue, 10 Dec 2024 09:37:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733852226; x=1734457026; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=6wzu7X1ZjViJF0SXMKQho06IRqPzYv6DxoYltDYvQnk=; b=g/Rec1Dw5xOvKFj1B3U1rApmOaWnfjS5C65VLiduMak5La3oUKafDFGTU+BSHW/6my Va8xrmH763snIvEhF5q8Zt9TMQpTXT6XET1+mosyiS9ASN7iTcc+UOSSHRU63UcGLDAw khmuygGJJnC5cgRfTzF7PKzOWQPKXkzSsWgBPtHx14QjtnUUkLSCCwcR8UrRn7LFcC/I 1DIqLfOfTvVWWc5kum9z7vtgJyqfD/KNBHypITr7WK+uVSbmxvWmPxe1uKpHK2KkamKY F7Ekq/kxT5Md85oPrWFmic28o4Vm5Pc/RShim93c0apeRoxH02lgb7EhtkuMi0Eqkyo2 Jhkg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733852226; x=1734457026; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6wzu7X1ZjViJF0SXMKQho06IRqPzYv6DxoYltDYvQnk=; b=s7xnpjXYWRiEF3PmgOOt2BZx69+xZml432vJ4L3Jqho+mV9+fTW4c8GPpM/z8ZeOjc PXNV6F4SoqZj9YstDzd+BXeQncq3EQtWvg6NDOcByAdTCyTwXgDjCBQbBRpq9yXVoilG RYRV8SCmvlSVczySlyL/n/I/JapF/XjMQhuvr1nVHMMczobLcUoJ792PHYU8XC0Nkuz6 Gv1zBXWUoyloQeTof3Io7AeyZh+YMyYPV1b5jEM64Mdl2TEqGSYE0pPP3RlIBamkSoFy yRatJI/716lCFJpmgHxwi5RVRSR+cA7XQ+a/OUcHTpJrvcp0/Qqjl/kNzakNQT2aKdMt gI0g== X-Forwarded-Encrypted: i=1; AJvYcCX3z8N4KYRqhod2xMmzF3qdhHwDAjfvAigl+CUcMWP5QuHP0mblTomYiZNiblX7wWi2ikykKxOyBg==@kvack.org X-Gm-Message-State: AOJu0YyVvJaEYY5WZzHMNNIDC2RI6Um+zV0FE/S0USiZzDPOl4GEv15D 5RM8oOap74WTZnEPSd1ADrY7ffcSVeUqWt5rTbPd23yKvXPUsGg0eMFT78sdVNMYL8uGjT7iA4y EKxjqFG+WLcd07CY4dL0Pn4DD8nx7HlHStFmI X-Gm-Gg: ASbGncs7LgB9WheaKgEyXotshOayxW5hKGDiOVOG7ZQqdI1OurAkJ/NRkk+lPDfBR+m A3ouv+fGX2WLR+axrOa4od6t/LbkqVcOrpcOqjsssOyctItAl07dUDgZeINXVYbM= X-Google-Smtp-Source: AGHT+IEdARILfjyINLngpuK+SsFhjFPi/HteqqvHDbkb0RqEV4fKS0qkKFVjCwpZ4InVBwgtudaS9P2xqeLiIDBGol8= X-Received: by 2002:a05:6402:1104:b0:5d1:22e1:7458 with SMTP id 4fb4d7f45d1cf-5d41f6efed8mr136449a12.4.1733852225608; Tue, 10 Dec 2024 09:37:05 -0800 (PST) MIME-Version: 1.0 References: <20241205143038.3260233-1-tujinjiang@huawei.com> <69b72e3d-b101-4641-9ce5-51346c93a98d@lucifer.local> In-Reply-To: From: Jann Horn Date: Tue, 10 Dec 2024 18:36:29 +0100 Message-ID: Subject: Re: [PATCH -next] ovl: respect underlying filesystem's get_unmapped_area() To: Lorenzo Stoakes Cc: Amir Goldstein , Jinjiang Tu , miklos@szeredi.hu, akpm@linux-foundation.org, vbabka@suse.cz, linux-mm@kvack.org, linux-unionfs@vger.kernel.org, wangkefeng.wang@huawei.com, sunnanyong@huawei.com, yi.zhang@huawei.com, Matthew Wilcox , Liam Howlett Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D810014002B X-Rspam-User: X-Rspamd-Server: rspam07 X-Stat-Signature: ao1bjxszt9iwfzjcj3w4cow5dmy7fr1j X-HE-Tag: 1733852212-694786 X-HE-Meta: U2FsdGVkX18CiRBC3+81s4M0YuPe+vdpVq/DSFPTdNEiSegjZZAUcnON9tpHunMZirdcZBb7WLhj1y4QalFWVLic5/GcErNrMULWecxCeEEpwBC8Xp+WYSE5OEHRJMvR40tOXFs93BBfzPonNsO3xg3+vXstPllmF1aeSAHpv+qeKXvpCYT2YQ5zBQejm9rRDRBIiEskgIHCHF8qZXoPG+7EJoPg4fKSxuSGPzThVpaumDbnr/GPvR2pVjJVQmvSqD5BxiL17BfN+/BrL/IwdikbWX7IgDBNK/fI8RAECQyfaqM7rreZBaUeziNH9NCV2H2uIoOZG21cqdYQmjZA5shUSJFhEJZ+Xm484yJ47vgVOs7TFnT0zrws9pSf8Ef+ZIDrTzqceis5Z6JJ3SfnF7hKvI6FPjxueKc57BGBL0rJO0sfv7VGWc3aEXR6O09auOOlxcMJZLOAr9pzIs4tsjzobhxjPTrXoD5kkRBFXhrJzKclHkl8UAqxDSE4T5PxBHUTa9rCWQWqNe0wd+gOw93l4UUp4wvY78+lohlT76L4rieuVn38DllRT9G2aepQu0T6lCRgkhg6JfQJeZbXE5nAiEn+HQnJ1Fz/ztpiQG0FYS/VxgDk8q8urjoNYwHrSTpIIcMvFYZ0Uqc8UKdMpgSp38NWbB6wfFlKnoLRB/Zfk/2IQmxgFJ3C2yKO6YmCTwLQqhLMWSa8DV3o7LTilBvDF/2aKTDvAj9BtBclmNbXm6F5zt15VFSGXQ/QnHXbkPjz/xII0Qgk5D9aOdStyVbHLkBUI3AjdJZaT5BgkbrITj5r2PWOhX11saw5p7hcgu5DNKs950CBCdQtrVXltYXOOoWCfrQwv0MZn2NXfQNRuMgwlN62sqQQu9wgTx4ceKjJSPep84f3NNbLN3g2XTCDYWybk9eKuYXKhI/nM4s3O2T4V1MYWoajnRe7yca1Tqg5VLf1+33GFUlVprO sGRyibaV L+f74yUWcSzGGYxOfyIrb3xGwYHQg+fGrH7hvwx4+bUzW14juBftkcdqK+ume8wqpQe9Xb5p40ELWWI9uZ1z0oLPRaZ8NCIr+AcgxqnmTfMvvF0cPu4TmuJFE8hBbwcKu3/phO6LlYojTDXBaSg433hjDyTRe/ejA8pr67IqJUFeR08mG0Nfax4vTSGFas23pdsEMSFEbfoifOIhobmk0kmTLfnaH9gGf9dvMpM1yuE83M3FhOOk9tMP/IBGP6kS5Bfed/TjSJQweNshLWZuWpg+GjXwdVnUQV4fZqig7sWi0pOIT1gtWksvOEw2UR+x0TnAuTBFVejcHzNq+DqrRhf4F7Ka6MA+ub2TH X-Bogosity: Ham, tests=bogofilter, spamicity=0.378660, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Dec 5, 2024 at 4:24=E2=80=AFPM Lorenzo Stoakes wrote: > (fixing typo in cc list: tujinjiang@huawe.com -> tujinjiang@huawei.com) > > + Liam > > (JinJiang - you forgot to cc the correct maintainers, please ensure you r= un > scripts/get_maintainers.pl on files you change) > > On Thu, Dec 05, 2024 at 04:12:12PM +0100, Amir Goldstein wrote: > > On Thu, Dec 5, 2024 at 4:04=E2=80=AFPM Lorenzo Stoakes > > wrote: > > > > > > + Matthew for large folio aspect > > > > > > On Thu, Dec 05, 2024 at 10:30:38PM +0800, Jinjiang Tu wrote: > > > > During our tests in containers, there is a read-only file (i.e., sh= ared > > > > libraies) in the overlayfs filesystem, and the underlying filesyste= m is > > > > ext4, which supports large folio. We mmap the file with PROT_READ p= rot, > > > > and then call madvise(MADV_COLLAPSE) for it. However, the madvise c= all > > > > fails and returns EINVAL. > > > > > > > > The reason is that the mapping address isn't aligned to PMD size. S= ince > > > > overlayfs doesn't support large folio, __get_unmapped_area() doesn'= t call > > > > thp_get_unmapped_area() to get a THP aligned address. > > > > > > > > To fix it, call get_unmapped_area() with the realfile. > > > > > > Isn't the correct solution to get overlayfs to support large folios? > > > > > > > > > > > Besides, since overlayfs may be built with CONFIG_OVERLAY_FS=3Dm, w= e should > > > > export get_unmapped_area(). > > > > > > Yeah, not in favour of this at all. This is an internal implementatio= n > > > detail. It seems like you're trying to hack your way into avoiding > > > providing support for large folios and to hand it off to the underlyi= ng > > > file system. > > > > > > Again, why don't you just support large folios in overlayfs? > > > > > > > This whole discussion seems moot. > > overlayfs does not have address_space operations > > It does not have its own page cache. > > And here we see my total lack of knowledge of overlayfs coming into play > here :) Thanks for pointing this out. > > In that case, I object even further to the original of course... > > > > > The file in vma->vm_file is not an overlayfs file at all - it is the > > real (e.g. ext4) file > > when returning from ovl_mmap() =3D> backing_file_mmap() > > so I have very little clue why the proposed solution even works, > > but it certainly does not look correct. > > I think then Jinjiang in this cause you ought to go back to the drawing > board and reconsider what might be the underlying issue here. To summarize: overlayfs switches out the VMA's backing file in the ->mmap handler. ->get_unmapped_area has to be called on the original file, before the VMA is set up (obviously), but the VMA's ->vm_file can only be overridden once the overlayfs ->mmap handler is called. So the ->get_unmapped_area you see early in the mmap path is provided by overlayfs, while the VMA you have in the end is actually basically just a VMA of the backing file that doesn't have much to do with the original file. So I guess some possible solutions would be that overlayfs forwards the .get_unmapped_area to the backing file manually, or that the ->vm_file swapping mechanism is changed to use some new separate file_operations handler for "I want to use another backing file" that is called before the get_unmapped_area stuff? (But to be clear, I'm not saying whether these are good ideas or not. Maybe Lorenzo has more of an opinion on that than I do.) By the way, I think FUSE is kinda similar, FUSE also has a "passthrough" mode that uses backing_file_mmap(); FUSE also doesn't have any special code in their .get_unmapped_area handler for this. But FUSE's .get_unmapped_area is set to thp_get_unmapped_area, which I guess the passthrough mode it is sorta wrong the other way around and unnecessarily over-aligns even if the backing file can't do THP?