From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE87AD1269D for ; Tue, 5 Nov 2024 12:45:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6A0E36B0099; Tue, 5 Nov 2024 07:45:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 64FF36B009A; Tue, 5 Nov 2024 07:45:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4F3F66B009B; Tue, 5 Nov 2024 07:45:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2E94B6B0099 for ; Tue, 5 Nov 2024 07:45:16 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D711EAD020 for ; Tue, 5 Nov 2024 12:45:15 +0000 (UTC) X-FDA: 82752010152.03.503BB95 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by imf06.hostedemail.com (Postfix) with ESMTP id 45732180018 for ; Tue, 5 Nov 2024 12:44:45 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=g5DDiNZB; spf=pass (imf06.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730810663; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uH3O49gahpCwXHsJg8KTfuuSAcNk/93y+You62c87eg=; b=8BnVwqKE/VADgPk1TSs0uOS2UvJBjntVKFduy/W2st0wNX0xPEb54XiE1VQIZGE/FX3+6h s7aV4GQwpIP+HtMq1QnRFzvBvQ4kESfXN8obtGS7G8NLyRBAAim6Mh8+5Cl9ZEMk0KIOGL 5p4Ti3qZcW3unq9CiA1r5ZJ1mjQWyHU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730810663; a=rsa-sha256; cv=none; b=ywQnuoLdqDfCiTyxXOj7GcTguzslc/H4lm+QlvOxnIJ05tRuZofdMB/bKHDHEFTLLv4RPw dqz4U3iCXiOLcfaUKKWJ53kic7izUN8DQxKXluwZfot4+q4RdBAwD2XCiwriXTHeZZOLqN f4SoaH3tDKKhPHeYjfvC1vfWY+qznZ0= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=g5DDiNZB; spf=pass (imf06.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1730810706; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=uH3O49gahpCwXHsJg8KTfuuSAcNk/93y+You62c87eg=; b=g5DDiNZBGYnT4rf7wHJh4KMdzdMZA7QICXAMsrIfu1h1/LOtUx6LxUK3H68bcCIw12CT/S71iz0OYDUMRpASFxcUoBX34f49O184lzhgO0CdyloSvhzGiiNl05rJl3N1e4q6oPm6/gNG1Sqiqr/I49Ia1BRmx9WxKbSThk3Z2fQ= Received: from 30.74.144.123(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WImyVVm_1730810704 cluster:ay36) by smtp.aliyun-inc.com; Tue, 05 Nov 2024 20:45:04 +0800 Message-ID: <8172f4fb-17ce-4df9-a8cf-f2bed0910370@linux.alibaba.com> Date: Tue, 5 Nov 2024 20:45:01 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs To: David Hildenbrand , Daniel Gomez , Daniel Gomez , "Kirill A. Shutemov" Cc: Matthew Wilcox , akpm@linux-foundation.org, hughd@google.com, wangkefeng.wang@huawei.com, 21cnbao@gmail.com, ryan.roberts@arm.com, ioworker0@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A . Shutemov" References: <6dohx7zna7x6hxzo4cwnwarep3a7rohx4qxubds3uujfb7gp3c@2xaubczl2n6d> <8e48cf24-83e1-486e-b89c-41edb7eeff3e@linux.alibaba.com> <486a72c6-5877-4a95-a587-2a32faa8785d@redhat.com> <7eb412d1-f90e-4363-8c7b-072f1124f8a6@linux.alibaba.com> <1b0f9f94-06a6-48ac-a68e-848bce1008e9@redhat.com> <7ca333ba-f9bc-4f78-8f5b-1035ca91c2d5@redhat.com> <0b7671fd-3fea-4086-8a85-fe063a62fa80@linux.alibaba.com> <2782890e-09dc-46bd-ab86-1f8974c7eb7a@linux.alibaba.com> <99a3cc07-bdc3-48e2-ab5c-6f4de1bd2e7b@redhat.com> From: Baolin Wang In-Reply-To: <99a3cc07-bdc3-48e2-ab5c-6f4de1bd2e7b@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: tmxaq5rufhkpf5c93h6tbahstoec8h6t X-Rspam-User: X-Rspamd-Queue-Id: 45732180018 X-Rspamd-Server: rspam02 X-HE-Tag: 1730810685-528542 X-HE-Meta: U2FsdGVkX19WzKIjTT6BEnaQbii//Tb3vYipFEMN2VeCn1gNb25XOB7qxyZNRpH01gj9eQjPSocLFdOdVBuC9uD6sLoEgyuLNpo+zDychu/Y600AC8ixwruIlGjA9E0ypTSOAa+BaIIlB2yn+7u/alYyC1xORBshQVJPp35XhFvcwZ8YeUWD+B2nXbk3hQe7yqmsHXsx1XqjR2DWq6FWEEAzt0GEKAoWMBZOEwB80XO6mKzruDQhSUdBROxjGHl1KoeTukKhKZXEwbM96wWGcBOSszVcSiy5GFzi0gC5GqZq11bYeLssnpE00FoMfA0bqNUYmZZMrsVtygmhmVutHuZqghsY4S7ORu6Y0M6qtKMx4wlMNObX7jTe5qmB55EHb1PT4fa6yS4sf0iFE4Szsox+VcCBoB7D1fjtx5GbZBYDOJ+2JNy5zQmxKL6aU35KjR4xFK542MYeIo2exF45/Jv3DBJ6oWOSi0E1IK5BiHZsAPJEtLpzoFnZDXm/LfRmO04q+J19zaC9SHXDPtlLo7tvVq3P+SrlzBaSMDebKAgcW745qEFgFVjksrKfn5UrtVs9HV8xEN6rGsUSzjQDjtCWVzDzpJfgJuKTxkK1mOFi7ZNcynLLQF1KXtfW95ZS7SHYtt9tgx4EyLiLkwMjzGVXwYPQPK8rLjnpo0+zkC7BdBPFDhUC43pi1C2qCt/k4uXcUptbxtb+dW77pWwjjmyPkNIaBp7sZhYVd+m8NLup4yelFdTYeElNvXiEo4rnEehtt99ec1wpAfz6vI6nR28d5E9QO0gAcrXjyNFi9HCFRxL0nHx6sQtm4B934y5iqvQGUQAwNBBGuNbhshc4/CQKEyFotDlRrigdnF3IoQWmqFXNzt981oIyhw+HtzbZRvxlcbktFY0qD4/a2Ob9NGPJIbpAfrPEFxRVdFNWWtlOftcjTR2forYXjMMs/Z2Z0DZGh+uWv5yEXcsc8vh sxSuiiMY xO1+I9ZjmFWfu4WkiNi6Z2ef68ISXyiPF8BC3UIrI03yNhory5TAR416Y68wG857/kAokU4egS0jSV9+rgv+UsOPNHVKdTXemIrlZjyk/aX238988wkS0kuW/hXEEoTjloZUk2uOPpcq9XNlGiKIosaJkfesNq0iLjcTgzQeb6UraDGTlvPdWWgRmbcE3C3ewWnj5CDYdxskGf3HwMokTBjY8E9rdM5zg6V9XGkwpTvBasJgzKtiQi0z0tj6llSOEj3t+JnTiH0Z+6CQweE615yZm7w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/31 18:46, David Hildenbrand wrote: [snip] >>> I don't like that: >>> >>> (a) there is no way to explicitly enable/name that new behavior. >> >> But this is similar to other file systems that enable large folios >> (setting mapping_set_large_folios()), and I haven't seen any other file >> systems supporting large folios requiring a new Kconfig. Maybe tmpfs is >> a bit special? > > I'm afraid I don't have the energy to explain once more why I think > tmpfs is not just like any other file system in some cases. > > And distributions are rather careful when it comes to something like > this ... > >> >> If we all agree that tmpfs is a bit special when using huge pages, then >> fine, a Kconfig option might be needed. >> >>> (b) "always" etc. are only concerned about PMDs. >> >> Yes, currently maintain the same semantics as before, in case users >> still expect THPs. > > Again, I don't think that is a reasonable approach to make PMD-sized > ones special here. It will all get seriously confusing and inconsistent. I agree PMD-sized should not be special. This is all for backward compatibility with the ‘huge=’ mount option, and adding a new kconfig is also for this purpose. > THPs are opportunistic after all, and page fault behavior will remain > unchanged (PMD-sized) for now. And even if we support other sizes during > page faults, we'd like start with the largest size (PMD-size) first, and > it likely might just all work better than before. > > Happy to learn where this really makes a difference. > > Of course, if you change the default behavior (which you are planning), > it's ... a changed default. > > If there are reasons to have more tunables regarding the sizes to use, > then it should not be limited to PMD-size. I have tried to modify the code according to your suggestion (not tested yet). These are what you had in mind? static inline unsigned int shmem_mapping_size_order(struct address_space *mapping, pgoff_t index, loff_t write_end) { unsigned int order; size_t size; if (!mapping_large_folio_support(mapping) || !write_end) return 0; /* Calculate the write size based on the write_end */ size = write_end - (index << PAGE_SHIFT); order = filemap_get_order(size); if (!order) return 0; /* If we're not aligned, allocate a smaller folio */ if (index & ((1UL << order) - 1)) order = __ffs(index); order = min_t(size_t, order, MAX_PAGECACHE_ORDER); return order > 0 ? BIT(order + 1) - 1 : 0; } static unsigned int shmem_huge_global_enabled(struct inode *inode, pgoff_t index, loff_t write_end, bool shmem_huge_force, unsigned long vm_flags) { bool is_shmem = inode->i_sb == shm_mnt->mnt_sb; unsigned long within_size_orders; unsigned int order; loff_t i_size; if (HPAGE_PMD_ORDER > MAX_PAGECACHE_ORDER) return 0; if (!S_ISREG(inode->i_mode)) return 0; if (shmem_huge == SHMEM_HUGE_DENY) return 0; if (shmem_huge_force || shmem_huge == SHMEM_HUGE_FORCE) return BIT(HPAGE_PMD_ORDER); switch (SHMEM_SB(inode->i_sb)->huge) { case SHMEM_HUGE_NEVER: return 0; case SHMEM_HUGE_ALWAYS: if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) return BIT(HPAGE_PMD_ORDER); return shmem_mapping_size_order(inode->i_mapping, index, write_end); case SHMEM_HUGE_WITHIN_SIZE: if (is_shmem || IS_ENABLED(CONFIG_USE_ONLY_THP_FOR_TMPFS)) within_size_orders = BIT(HPAGE_PMD_ORDER); else within_size_orders = shmem_mapping_size_order(inode->i_mapping, index, write_end); order = highest_order(within_size_orders); while (within_size_orders) { index = round_up(index + 1, 1 << order); i_size = max(write_end, i_size_read(inode)); i_size = round_up(i_size, PAGE_SIZE); if (i_size >> PAGE_SHIFT >= index) return within_size_orders; order = next_order(&within_size_orders, order); } fallthrough; case SHMEM_HUGE_ADVISE: if (vm_flags & VM_HUGEPAGE) { if (is_shmem || IS_ENABLED(USE_ONLY_THP_FOR_TMPFS)) return BIT(HPAGE_PMD_ORDER); return shmem_mapping_size_order(inode->i_mapping, index, write_end); } fallthrough; default: return 0; } } 1) Add a new 'CONFIG_USE_ONLY_THP_FOR_TMPFS' kconfig to keep ‘huge=’ mount option compatibility. 2) For tmpfs write(), if CONFIG_USE_ONLY_THP_FOR_TMPFS is not enabled, then will get the possible huge orders based on the write size. 3) For tmpfs mmap() fault, always use PMD-sized huge order. 4) For shmem, ignore the write size logic and always use PMD-sized THP to check if the global huge is enabled. However, in case 2), if 'huge=always' and write size is less than 4K, so we will allocate small pages, that will break the 'huge' semantics? Maybe it's not something to worry too much about. >>> huge=never: No THPs of any size >>> huge=always: THPs of any size >>> huge=fadvise: like "always" but only with fadvise/madvise >>> huge=within_size: like "fadvise" but respect i_size >>> >>> "huge=" default depends on a Kconfig option. >>> >>> With that we: >>> >>> (1) Maximize the cases where we will use large folios of any sizes >>>       (which Willy cares about). >>> (2) Have a way to disable them completely (which I care about). >>> (3) Allow distros to keep the default unchanged. >>> >>> Likely, for now we will only try allocating PMD-sized THPs during page >>> faults, and allocate different sizes only during write(). So the effect >>> for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be >>> completely unchanged even with "huge=always". >>> >>> It will get more tricky once we change that behavior as well, but that's >>> something to likely figure out if it is a real problem at at different >>> day :) >>> >>> >>> I really preferred using the sysfs toggles (as discussed with Hugh in >>> the meeting back then), but I can also understand why we at least want >>> to try making tmpfs behave more like other file systems. But I'm a bit >>> more careful to not ignore the cases where it really isn't like any >>> other file system. >> >> That's also my previous thought, but Matthew is strongly against that. >> Let's step by step. > > Yes, I understand his view as well. > > But I won't blindly agree to the "tmpfs is just like any other file > system" opinion :) > > > >> If we start making PMD-sized THPs special in any non-configurable > way, >>> then we are effectively off *worse* than allowing to configure them >>> properly. So if someone voices "but we want only PMD-sized" ones, the >>> next one will say "but we only want cont-pte sized-ones" and then we >>> should provide an option to control the actual sizes to use differently, >>> in some way. But let's see if that is even required. >> >> Yes, I agree. So what I am thinking is, the 'huge=' option should be >> gradually deprecated in the future and eventually tmpfs can allocate any >> size large folios as default. > > Let's be realistic, it won't get removed any time soon. ;) > > So changing "huge=always" etc. semantics to reflect our new size > options, and then try changing the default (with the option for > people/distros to have the old default) is a reasonable approach, at > least to me. > > I'm trying to stay open-minded here, but the proposal I heard so far is > not particularly appealing. >