From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 848A7D1715C for ; Tue, 22 Oct 2024 03:41:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1D9526B008C; Mon, 21 Oct 2024 23:41:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 162826B0092; Mon, 21 Oct 2024 23:41:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 002FE6B0093; Mon, 21 Oct 2024 23:41:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D0CC96B008C for ; Mon, 21 Oct 2024 23:41:16 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E7252121C27 for ; Tue, 22 Oct 2024 03:41:01 +0000 (UTC) X-FDA: 82699837080.19.B31BE24 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by imf13.hostedemail.com (Postfix) with ESMTP id EA74B20011 for ; Tue, 22 Oct 2024 03:40:57 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=ipcqEc6V; spf=pass (imf13.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729568274; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PObwzHJJIeFT+P4LK3JrB627zhyrly0VUYXDLNen+YQ=; b=kCguMUCHuGs+EbZJt2Q5yNTldpg7iCVbVD9BdC336Y4cvSY0S4JChg289rCSApD/rx8Rn3 FSCvYDKmVVc5PgTTSMSBmveylgyFGaA1hj/ZYDBBdvqA4vLBDsVPe6HxbyaXjtCQ+7S2Ui 4pffYSWu3VVQTkrTDqWymVTuIkI/+Ac= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=ipcqEc6V; spf=pass (imf13.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729568274; a=rsa-sha256; cv=none; b=le6gd8BuvEv094gG/doVM0VE1Lferm8cKAaft8xlHuzWIHOpzyVFe1pC8ex9zEEqHssSsw LaOPkkAtSOzCHQH9DJgY1Hra0EWjj08mvo48gLYykKyDx/qBrXnE9wzX6v87vSm1g20yqF vAzQDflsMKBSmjjpv9EML347ymY2zKY= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1729568471; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=PObwzHJJIeFT+P4LK3JrB627zhyrly0VUYXDLNen+YQ=; b=ipcqEc6VJ5C8SPJdJu6KbIjTPxNFCQXcKogyH8oyNegwtq8+vX8qJnXO48sreyZkFqlgOnvVHLUVLWsp986rBEZyUCgXNc7Q7UK0TREWTyPOdWmi9Vhlvv+tud7FgltHcVIOAnCcRah5Z31QWbq0DqEflbY4U1WFVpE0sYGB8H8= Received: from 30.74.144.133(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WHgJnN7_1729568469 cluster:ay36) by smtp.aliyun-inc.com; Tue, 22 Oct 2024 11:41:09 +0800 Message-ID: Date: Tue, 22 Oct 2024 11:41:08 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs To: Daniel Gomez , "Kirill A. Shutemov" Cc: Matthew Wilcox , akpm@linux-foundation.org, hughd@google.com, david@redhat.com, wangkefeng.wang@huawei.com, 21cnbao@gmail.com, ryan.roberts@arm.com, ioworker0@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A . Shutemov" References: <6dohx7zna7x6hxzo4cwnwarep3a7rohx4qxubds3uujfb7gp3c@2xaubczl2n6d> <8e48cf24-83e1-486e-b89c-41edb7eeff3e@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: EA74B20011 X-Stat-Signature: irz73r8693aypsee5yeno58mr9ah344p X-Rspam-User: X-HE-Tag: 1729568457-219397 X-HE-Meta: U2FsdGVkX1+vxJmDJ5xGcu4EmGXPdtbEgNEFqeSQdIkDFdgpBFYX5vgwEVm7rh6rDUPBM92WdQjO+KqsvJhIDyfUcWMcR+Pp5we4UjJnvMMev40q0scsF2eJr2bBeS9SROQmcVYNgnh3vs9QLofu2o8Z48pW6Pru0DEtFgw5UoL2xchxk+qDFTy3UdfA/wjBJ67aoWGh4gSmhh4iu42DQfTQujik9Iggdhg/Lfi26fKz2R9azT3UfLUHdcxqDUD/yiwoe0/IY/E4BS/PCquju1kpMnCNkbkYCcg92cewxO5nekocKOBBvIlX13M9MI25d+7NHVkF2OKeiMP3VSkPhlO02YyqhdoYqS5t8U+vj09tTeCmjhZeLHXKkUNbzn9J8En9ikmq5t0zLh0BLdRP+HDD6oLtnL2dC/t+Dr/ntR299Wcq6+fx+a74e5eC20Ye+8u6FtfFaQd1z/fOWZkrie9TKWFdvSAL4XtMOmzEIq6lZD6wKb3GLvufJdwjD60Hxs3Nat4pF8CUR7N7m63Wv2wxAPaFp9xDGzq0Ngq9+n1pPsdXMTs+UAZVNiXkzvLT5cWOvwsS4uK89nj4XMou1/h9U0GZ2USSsby/ExgxmYB54nxKXGpp3RNmQ8US1Qzfl/UeJs8SDL2o0l9YxtaUV3w+mZ0UQHItrsC60JERS3c6Q1m9nO2DCeHldqJt0hgJYO8f5zxzX50e/8zjbiV3kysN5J8x7Gn0G5AE2r5P10zheec7VPi1KV4R+d747mOZJBuoTLDwThex3EcSu7eVbviEaN51IOCypYuwrU5cl17nyb/wPyx+GsfJu7l8OfMg1PNnNYeHS+mH+R/w4hffn/LnOgHB4Yr5GseMUgFeap3sRtfPZMkThsNKxhI9Q0WyTzQz4lJZ79/aIPe6bjBZv/TPEX8WcyXcr4WKWFLNPIk2KMvnT3/PgN/9RVyCla/tYD4v9oAxmoS59a3C1y4 ZDOP0EBP 1PMUoFOjeSS7B6IHXI+WTu4cqJsomxIBMJXBx0/yof4xel5e9uoH6CJsQt3sZCXeUTnuD2s//DiEFF1QpzX/tA5YiOPYXtBzx+PIQw6HKQwf41FIzZ8iPNBDdY8FBxanlaIHzbaWOnGYzkmTJhD0jkVY0R0NTObopZtJUoq456K/MCy6SudKX8j8azZ+NHLIvZyg3iMKns4NYjPHas1SeDQdcyZ7gcbH7J8sDW5gWQS2WAwEAdgcCzpoZBjTN8PYxCHinMu9LGdSc9mAKyDL8xJnPpw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/21 21:34, Daniel Gomez wrote: > On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: >> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: >>> >>> >>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: >>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: >>>>> + Kirill >>>>> >>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: >>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: >>>>>>> Considering that tmpfs already has the 'huge=' option to control the THP >>>>>>> allocation, it is necessary to maintain compatibility with the 'huge=' >>>>>>> option, as well as considering the 'deny' and 'force' option controlled >>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. >>>>>> >>>>>> No, it's not. No other filesystem honours these settings. tmpfs would >>>>>> not have had these settings if it were written today. It should simply >>>>>> ignore them, the way that NFS ignores the "intr" mount option now that >>>>>> we have a better solution to the original problem. >>>>>> >>>>>> To reiterate my position: >>>>>> >>>>>> - When using tmpfs as a filesystem, it should behave like other >>>>>> filesystems. >>>>>> - When using tmpfs to implement MAP_ANONYMOUS | MAP_SHARED, it should >>>>>> behave like anonymous memory. >>>>> >>>>> I do agree with your point to some extent, but the ‘huge=’ option has >>>>> existed for nearly 8 years, and the huge orders based on write size may not >>>>> achieve the performance of PMD-sized THP in some scenarios, such as when the >>>>> write length is consistently 4K. So, I am still concerned that ignoring the >>>>> 'huge' option could lead to compatibility issues. >>>> >>>> Yeah, I don't think we are there yet to ignore the mount option. >>> >>> OK. >>> >>>> Maybe we need to get a new generic interface to request the semantics >>>> tmpfs has with huge= on per-inode level on any fs. Like a set of FADV_* >>>> handles to make kernel allocate PMD-size folio on any allocation or on >>>> allocations within i_size. I think this behaviour is useful beyond tmpfs. >>>> >>>> Then huge= implementation for tmpfs can be re-defined to set these >>>> per-inode FADV_ flags by default. This way we can keep tmpfs compatible >>>> with current deployments and less special comparing to rest of >>>> filesystems on kernel side. >>> >>> I did a quick search, and I didn't find any other fs that require PMD-sized >>> huge pages, so I am not sure if FADV_* is useful for filesystems other than >>> tmpfs. Please correct me if I missed something. >> >> What do you mean by "require"? THPs are always opportunistic. >> >> IIUC, we don't have a way to hint kernel to use huge pages for a file on >> read from backing storage. Readahead is not always the right way. >> >>>> If huge= is not set, tmpfs would behave the same way as the rest of >>>> filesystems. >>> >>> So if 'huge=' is not set, tmpfs write()/fallocate() can still allocate large >>> folios based on the write size? If yes, that means it will change the >>> default huge behavior for tmpfs. Because previously having 'huge=' is not >>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar to what I >>> mentioned: >>> "Another possible choice is to make the huge pages allocation based on write >>> size as the *default* behavior for tmpfs, ..." >> >> I am more worried about breaking existing users of huge pages. So changing >> behaviour of users who don't specify huge is okay to me. > > I think moving tmpfs to allocate large folios opportunistically by > default (as it was proposed initially) doesn't necessary conflict with > the default behaviour (huge=never). We just need to clarify that in > the documentation. > > However, and IIRC, one of the requests from Hugh was to have a way to > disable large folios which is something other FS do not have control > of as of today. Ryan sent a proposal to actually control that globally > but I think it didn't move forward. So, what are we missing to go back > to implement large folios in tmpfs in the default case, as any other fs > leveraging large folios? IMHO, as I discussed with Kirill, we still need maintain compatibility with the 'huge=' mount option. This means that if 'huge=never' is set for tmpfs, huge page allocation will still be prohibited (which can address Hugh's request?). However, if 'huge=' is not set, we can allocate large folios based on the write size.