From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 371BFD116E0 for ; Fri, 25 Oct 2024 02:56:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AB17E6B0096; Thu, 24 Oct 2024 22:56:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A614E6B0099; Thu, 24 Oct 2024 22:56:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 928A86B009B; Thu, 24 Oct 2024 22:56:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 75B6D6B0096 for ; Thu, 24 Oct 2024 22:56:40 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id DCB621A03BE for ; Fri, 25 Oct 2024 02:56:05 +0000 (UTC) X-FDA: 82710611550.28.BEB6078 Received: from out30-98.freemail.mail.aliyun.com (out30-98.freemail.mail.aliyun.com [115.124.30.98]) by imf14.hostedemail.com (Postfix) with ESMTP id EC48D100010 for ; Fri, 25 Oct 2024 02:56:13 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=WIhiqk95; spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729824945; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r5sK3skr1wSoBWwJCUEl4n7b+a68KhVtGJHz3Rxecmg=; b=pKnWqjYSO9uha/0d0IDI3b0470icVBaF4k0zD8BR0npkLEOVy+Hm4RsvqzuLKJf+0a03Zq ktlUpyOFxasAef3pe8JxX+DHj5zJ0nUncYb+HEDATp2D05t4rcnYAHtZMiKwiCjy6wzVmU /v6qizuA7VK6sAqT4YDkbENtoMhJfCw= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=WIhiqk95; spf=pass (imf14.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.98 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729824945; a=rsa-sha256; cv=none; b=ehpAaM+4us0U0Dv844Pl6pHXVymji8lDwlX4SAN2w4SI6YoSrcXhVykDOXaCKVA/Jnbg/U q2XHVoyy2lj6lKXwsaaWgTqL2GbvQzXbCQHDKOURPwEp1iff1dwm/VL3AjaWWIDRVWtP/Q pIBcADqorX/HlqpA+ZLdDuz4H1YurmQ= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1729824992; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=r5sK3skr1wSoBWwJCUEl4n7b+a68KhVtGJHz3Rxecmg=; b=WIhiqk95+Tf4AhjSIOOHnnxXgd+oUrZ7mPi+bUIAno5AhYTP5ZTRFbqXVkFGRspyPbiPOKmvJRwAnqAoFVi3T6wN50kaPlHxHDl6hq7Gz7YvK7B+EaAm14i/6K3hkVQIJTjN4il61L7CKTite/FH3OnxBv7a4DGU/MuvJlSKTYo= Received: from 30.74.144.130(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WHqp5Ts_1729824989 cluster:ay36) by smtp.aliyun-inc.com; Fri, 25 Oct 2024 10:56:30 +0800 Message-ID: <645ec5ee-ad60-4114-85fb-d19b5791d8a9@linux.alibaba.com> Date: Fri, 25 Oct 2024 10:56:29 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs To: Daniel Gomez , David Hildenbrand , Daniel Gomez , "Kirill A. Shutemov" Cc: Matthew Wilcox , akpm@linux-foundation.org, hughd@google.com, wangkefeng.wang@huawei.com, 21cnbao@gmail.com, ryan.roberts@arm.com, ioworker0@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A . Shutemov" References: <6dohx7zna7x6hxzo4cwnwarep3a7rohx4qxubds3uujfb7gp3c@2xaubczl2n6d> <8e48cf24-83e1-486e-b89c-41edb7eeff3e@linux.alibaba.com> <486a72c6-5877-4a95-a587-2a32faa8785d@redhat.com> <7eb412d1-f90e-4363-8c7b-072f1124f8a6@linux.alibaba.com> <1b0f9f94-06a6-48ac-a68e-848bce1008e9@redhat.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Stat-Signature: k8ewzm47xuhcbdohuxa8e4aqof11d6j1 X-Rspamd-Queue-Id: EC48D100010 X-Rspamd-Server: rspam11 X-HE-Tag: 1729824973-413457 X-HE-Meta: U2FsdGVkX1+fRGEvp1VfrtBR6T40lJK5xhJXnzD/HHyjUKRgtptvIvhVHtwDWnFZiLjNcnyamFnemJADx39vDacS5dlQ7RrlR/3m/y9yVMz8hbNfX/R/ZhVpibqPLAzpqbH2t0pTlJJmrWU1DKibO8xncEAjBpDtEv6xqOSpbu1zvPtT3U2xiOk+3FbvbcuW8fYTWQtJxOFAS3vFZN7cnApSjIOggsxaVuq0Ama0ngIUURt47xXi8HxSXFWY241czLIeIi+6rHmVoEIo0mPAZwXFfV+7IfpNZYCqdHHKj5Qs30e65Mou8qvXIAU3EvGHdD2liMfWumJmD4TGxBEjayZPsqpN5xJdVlxNzso+X7wwXWVw1FhZ6ZprU8OdPuSxgxjbXRxxrzS2O4k75X24A2iHrxxJHPWf0WV/1SFcJh/4Xeao0WPKFWQqWl6PSFXp29dzoFbZXaRDma6UwyxQbG3vWQb0oh36VqwMuYvknCOmzGwEx1Y8kK80KhShyV/8jkDYVlf6u9OsjpBEtTQu+HcCoXimizxze1wIz6bhROvCoWF622+1z2aNLE5D20P8DxsI2odkYrRL2vedr74ZbUlMH8xNBb4qXDxEBU4tu17C2mMaSnxu9w1UjzScJCZjtGSA6A3rDDtD7edOkKG5Eoa7TluoeEFA46oqLpi2/1TUQh4nvit5rPxo2wFBj2XQA5l6HeMowC3KlB5i/CM5y2bQTHcHi+Zh0GWR21CfdjkxJYAs/D116ZXS753+mvRKFXOVG9eSt1ELr8FpVMfWBQWPB9PfQj5bAVDNiRv61poGChYeaNCaDNmlBf5gWakH93lJR5+LSYj3J0VduzEfLc2scfTGTgQcepc1VR438EyLpU1HEP5XBz0OkIOlOkglf/z3WdZ9U0wnUYNl/p/YgS48+EpkBmLoM5OiXJ2WnySkrOUKZXCxDiP6HdkTGAx/ge7XyXpItrxrYQsd1Mu 2a4ZWCLo N3ARu/RWxRREKJNPgQgGpvZKEIywj3mMbQyHh7IcvhOfHh8749h5BM98pvgWq9cR9+XOYh4faBkbXzCdZpb3j//LFIK7GDW7Bnq3XRTcbFGTLHileEluWhQcDSFsj57A59JoIGB4zjcIWGqsgVEpwUX4knD5jU3Q40vqsvRbQ7pLP8q0ZIF+2jQXmTrp7+/QWH3wnZFZNWhhG/Oo6CtPmWGI07aBRCX7o7ebukAgTO8TRWbedkV776PBJ0pf2Lh1AYAHXf2UfsUAYucuYEUSiC9LFoQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/24 18:49, Daniel Gomez wrote: > On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote: >> On 23.10.24 10:04, Baolin Wang wrote: >>> >>> >>> On 2024/10/22 23:31, David Hildenbrand wrote: >>>> On 22.10.24 05:41, Baolin Wang wrote: >>>>> >>>>> >>>>> On 2024/10/21 21:34, Daniel Gomez wrote: >>>>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: >>>>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>>>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>>>>>>>> + Kirill >>>>>>>>>> >>>>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>>>>>>>> Considering that tmpfs already has the 'huge=' option to >>>>>>>>>>>> control the THP >>>>>>>>>>>> allocation, it is necessary to maintain compatibility with the >>>>>>>>>>>> 'huge=' >>>>>>>>>>>> option, as well as considering the 'deny' and 'force' option >>>>>>>>>>>> controlled >>>>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>>>>>>>> >>>>>>>>>>> No, it's not.  No other filesystem honours these settings. >>>>>>>>>>> tmpfs would >>>>>>>>>>> not have had these settings if it were written today.  It should >>>>>>>>>>> simply >>>>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option >>>>>>>>>>> now that >>>>>>>>>>> we have a better solution to the original problem. >>>>>>>>>>> >>>>>>>>>>> To reiterate my position: >>>>>>>>>>> >>>>>>>>>>>      - When using tmpfs as a filesystem, it should behave like >>>>>>>>>>> other >>>>>>>>>>>        filesystems. >>>>>>>>>>>      - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, >>>>>>>>>>> it should >>>>>>>>>>>        behave like anonymous memory. >>>>>>>>>> >>>>>>>>>> I do agree with your point to some extent, but the ‘huge=’ option >>>>>>>>>> has >>>>>>>>>> existed for nearly 8 years, and the huge orders based on write >>>>>>>>>> size may not >>>>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such >>>>>>>>>> as when the >>>>>>>>>> write length is consistently 4K. So, I am still concerned that >>>>>>>>>> ignoring the >>>>>>>>>> 'huge' option could lead to compatibility issues. >>>>>>>>> >>>>>>>>> Yeah, I don't think we are there yet to ignore the mount option. >>>>>>>> >>>>>>>> OK. >>>>>>>> >>>>>>>>> Maybe we need to get a new generic interface to request the semantics >>>>>>>>> tmpfs has with huge= on per-inode level on any fs. Like a set of >>>>>>>>> FADV_* >>>>>>>>> handles to make kernel allocate PMD-size folio on any allocation >>>>>>>>> or on >>>>>>>>> allocations within i_size. I think this behaviour is useful beyond >>>>>>>>> tmpfs. >>>>>>>>> >>>>>>>>> Then huge= implementation for tmpfs can be re-defined to set these >>>>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs >>>>>>>>> compatible >>>>>>>>> with current deployments and less special comparing to rest of >>>>>>>>> filesystems on kernel side. >>>>>>>> >>>>>>>> I did a quick search, and I didn't find any other fs that require >>>>>>>> PMD-sized >>>>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems >>>>>>>> other than >>>>>>>> tmpfs. Please correct me if I missed something. >>>>>>> >>>>>>> What do you mean by "require"? THPs are always opportunistic. >>>>>>> >>>>>>> IIUC, we don't have a way to hint kernel to use huge pages for a >>>>>>> file on >>>>>>> read from backing storage. Readahead is not always the right way. >>>>>>> >>>>>>>>> If huge= is not set, tmpfs would behave the same way as the rest of >>>>>>>>> filesystems. >>>>>>>> >>>>>>>> So if 'huge=' is not set, tmpfs write()/fallocate() can still >>>>>>>> allocate large >>>>>>>> folios based on the write size? If yes, that means it will change the >>>>>>>> default huge behavior for tmpfs. Because previously having 'huge=' >>>>>>>> is not >>>>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar >>>>>>>> to what I >>>>>>>> mentioned: >>>>>>>> "Another possible choice is to make the huge pages allocation based >>>>>>>> on write >>>>>>>> size as the *default* behavior for tmpfs, ..." >>>>>>> >>>>>>> I am more worried about breaking existing users of huge pages. So >>>>>>> changing >>>>>>> behaviour of users who don't specify huge is okay to me. >>>>>> >>>>>> I think moving tmpfs to allocate large folios opportunistically by >>>>>> default (as it was proposed initially) doesn't necessary conflict with >>>>>> the default behaviour (huge=never). We just need to clarify that in >>>>>> the documentation. >>>>>> >>>>>> However, and IIRC, one of the requests from Hugh was to have a way to >>>>>> disable large folios which is something other FS do not have control >>>>>> of as of today. Ryan sent a proposal to actually control that globally >>>>>> but I think it didn't move forward. So, what are we missing to go back >>>>>> to implement large folios in tmpfs in the default case, as any other fs >>>>>> leveraging large folios? >>>>> >>>>> IMHO, as I discussed with Kirill, we still need maintain compatibility >>>>> with the 'huge=' mount option. This means that if 'huge=never' is set >>>>> for tmpfs, huge page allocation will still be prohibited (which can >>>>> address Hugh's request?). However, if 'huge=' is not set, we can >>>>> allocate large folios based on the write size. > > So, in order to make tmpfs behave like other filesystems, we need to > allocate large folios by default. Not setting 'huge=' is the same as > setting it to 'huge=never' as per documentation. But 'huge=' is meant to > control THP, not large folios, so it should not have a conflict here, or > else, what case are you thinking? > > So, to make tmpfs behave like other filesystems, we need to allocate > large folios by default. According to the documentation, not setting Right. > 'huge=' is the same as setting 'huge=never.' However, 'huge=' is I will update the documentation in next version. That means if 'huge=' option is not set, we can still allocate large folios based on the write size (will be not same as setting 'huge=never'). > intended to control THP, not large folios, so there shouldn't be > a conflict in this case. Can you clarify what specific scenario or Yes, we should still keep the same semantics of 'huge=always/within_size/advise' setting, which only controls THP allocations. > conflict you're considering here? Perhaps when large folios order is the > same as PMD-size?