From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24052E6896A for ; Thu, 31 Oct 2024 10:05:08 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A95986B008C; Thu, 31 Oct 2024 06:05:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A45586B0092; Thu, 31 Oct 2024 06:05:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 90C996B0093; Thu, 31 Oct 2024 06:05:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 725136B008C for ; Thu, 31 Oct 2024 06:05:07 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 1C8AF12146C for ; Thu, 31 Oct 2024 10:05:07 +0000 (UTC) X-FDA: 82733463582.15.C7B7104 Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) by imf17.hostedemail.com (Postfix) with ESMTP id 7DD2D40021 for ; Thu, 31 Oct 2024 10:04:39 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=YPvEQJ9H; spf=pass (imf17.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730368944; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kxW54KB/34Dmh3dGyprdwmde4eQJWorML/uKda4ik4w=; b=na3N1zVzVdr+JvYMvLeD2IIel3Q0uXnRKj8amXml2sHWtTVVWf3poyHt3y0Z5W49KSQ5jq mOlyvuq9pyfq+VfN3c97n+B6vhB9HJpScf5RbftEJ3utuNmZ1OBMOWceRsNxmMGsOEgQSY zwC6tLqx+xUQJ+PXAWkAKBAkQoPEHxA= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=YPvEQJ9H; spf=pass (imf17.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730368944; a=rsa-sha256; cv=none; b=GjxZsUWIFZ4rYCyia0Jk4nfPpeZAgn5KrI157OoiTsyfPNT548KziNqh6uQD30M1oVmhKX 5Q+Ak5AJz0xREksMgSykCmvikzVxOBEB0TCDsrygrgedJivV7MwoXkm0t7f271SqTx3apV PStFZ4w0tPqlbrh3GadD2qkN+0dWcck= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1730369095; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=kxW54KB/34Dmh3dGyprdwmde4eQJWorML/uKda4ik4w=; b=YPvEQJ9HdhlFVyeu4fIp5KzfcVDtmt/yoGFRSKvdlm6wLv/Mn6nu326t2ImQ8+Z0DdSP6DX6bkfz9itebENdWGlegdr92P2IBv8MxD/LTeuH7xTNHN5Epb8Ii4r/EpMWlGJCr40/x11pBnP8CAooVsmyUdlUe7A1nmTIuDDCEy4= Received: from 30.74.144.119(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WII9li5_1730369092 cluster:ay36) by smtp.aliyun-inc.com; Thu, 31 Oct 2024 18:04:53 +0800 Message-ID: <2782890e-09dc-46bd-ab86-1f8974c7eb7a@linux.alibaba.com> Date: Thu, 31 Oct 2024 18:04:51 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs To: David Hildenbrand , Daniel Gomez , Daniel Gomez , "Kirill A. Shutemov" Cc: Matthew Wilcox , akpm@linux-foundation.org, hughd@google.com, wangkefeng.wang@huawei.com, 21cnbao@gmail.com, ryan.roberts@arm.com, ioworker0@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A . Shutemov" References: <6dohx7zna7x6hxzo4cwnwarep3a7rohx4qxubds3uujfb7gp3c@2xaubczl2n6d> <8e48cf24-83e1-486e-b89c-41edb7eeff3e@linux.alibaba.com> <486a72c6-5877-4a95-a587-2a32faa8785d@redhat.com> <7eb412d1-f90e-4363-8c7b-072f1124f8a6@linux.alibaba.com> <1b0f9f94-06a6-48ac-a68e-848bce1008e9@redhat.com> <7ca333ba-f9bc-4f78-8f5b-1035ca91c2d5@redhat.com> <0b7671fd-3fea-4086-8a85-fe063a62fa80@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 7DD2D40021 X-Stat-Signature: pe3wgt85jg9tdnijc7dmnez5ac3c95un X-Rspam-User: X-HE-Tag: 1730369079-461508 X-HE-Meta: U2FsdGVkX189tHqJUMBX0E+9C+v1/joWqEzlnVkJhG8/SWG8P8/V40Z3A5daoDCvoEH+au7OPg450IVds1ZBhCtDUUjLrp1AT6im49/CUicoXTMIPMEd5h92H04ZtUyzm2AsdZZKMWz32wXZriUUal+V7O6C1wFXaSd/OY8AKlFEJV2tlGLW/meDernEf5CCuBjwmRkVZpdi/XKcJfMdHHrs19qKDDKr6Uc/BU1UwwKMrLrniC/cksDtN93KziHNaf8m6hJBYkRNrPYjAvwPM5jn4OrtUj4YAlPHupC12WIdlyEvkH2DOwGLxmi35yHnRrFkDPGEB+DqexA+cPWoR6xjuIxuvnmOmjKAp/xUx5aJUnktDO60X7J2Mitoid6E8YCCKoXpqXOtOUvvq43iPbRXpeZ7x4zF/jwaV+5V9WSgYMQ+rAoM6OD5hWv7ye4PKkYiOihEeo5+DaKxsLYOJdzk5xW9EeT9Pkbt04sJIqgBvyTfrFk+XwBPmU6N58K45ipX5fw6P0emxrkgXU26Arf6IQN3TKfBb4666AU90lIqOV2qEMu6PoQZBB8OAL0XDSGqUFMQsK7EssXNpiFSRqCNfb4wyNKPSYT3ZoKVbqxfe5IRPUnIMDebNXDpplOr/hZqHfHUaUna4gjkLv/tXp1LI30pJASAy20KHuTQBN0DWp9AQ9r/HPaq2u4l1o5r8oIs9cQ3yMNvO1Gg3Nsd/YgbZxNe4P+45nEcWH+YvNg/plqPRjmW0I1Oa7NnyZeAYy7UYoRWJX/i/Z+dGFzhZiU8dyoAz6AXgNRMS+ZqBjudS4uvla2XGQ4qHbuEMxBWIZbASEMJSMoayM+xp0lza9m+mLcnt7JBPZtOhKhlMjtugmUBafJYo/8EJhE7oEuH/iY31DDf5jkE8f3NsSSD4AkfCtYzXvkFn1BhtFDuk7MPQMXd/5PaBtK6anjhL+5QDBXserkCmW7h2H4SxiW J8ByS+W6 5gHa2xo7n4R9utrehz8DN2tNi4adQtrZDV8C/am1o/yrlM/3fIhb2yqiT/P1em6o1TyI3u8VcTdx+4uQ4S3h7mo7CHJVAylShXdSXRBgXH3O7b4iSaj0whOnI5HkPWgYPcMmHD0I/04HL4+0KvasviH8AploHr0/BXPDa4OYxy9c/lSDYVJXAf7WZ7KalV5WiBS542aW0XaoGqmNl7cg8FcaoU0v8hBVaoFaZevEINYgOED6+1CzGFag4ebJ17Ww2GVj6UhYCWtvz8umqxgsJSHFIRQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/10/31 16:53, David Hildenbrand wrote: >>>> >>>> If we don't want to go with the shmem_enabled toggles, we should >>>> probably still extend the documentation to cover "all THP sizes", like >>>> we did elsewhere. >>>> >>>> huge=never: no THPs of any size >>>> huge=always: THPs of any size (fault/write/etc) >>>> huge=fadvise: like "always" but only with fadvise/madvise >>>> huge=within_size: like "fadvise" but respect i_size >>> >>> Thinking some more about that over the weekend, this is likely the way >>> to go, paired with conditionally changing the default to >>> always/within_size. I suggest a kconfig option for that. >> >> I am still worried about adding a new kconfig option, which might >> complicate the tmpfs controls further. > > Why exactly? There will be more options to control huge pages allocation for tmpfs, which may confuse users and make life harder? Yes, we can add some documentation, but I'm still a bit cautious about this. > If we are changing a default similar to > CONFIG_TRANSPARENT_HUGEPAGE_NEVER -> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS, > it would make perfectly sense to give people building a kernel control > over that. > > If we want to support this feature in a distro kernel like RHEL we'll > have to leave the default unmodified. Otherwise I see no way (excluding > downstream-only hacks) to backport this into distro kernels. > >> >>> That should probably do as a first shot; I assume people will want more >>> control over which size to use, especially during page faults, but that >>> can likely be added later. > > I know, it puts you in a bad position because there are different > opinions floating around. But let's try to find something that is > reasonable and still acceptable. And let's hope that Hugh will voice an > opinion :D Yes, I am also waiting to see if Hugh has any inputs :) >> After some discussions, I think the first step is to achieve two goals: >> 1) Try to make tmpfs use large folios like other file systems, that >> means we should avoid adding more complex control options (per Matthew). >> 2) Still need maintain compatibility with the 'huge=' mount option (per >> Kirill), as I also remembered we have customers who use >> 'huge=within_size' to allocate THPs for better performance. > >> >> Based on these considerations, my first step is to neither add a new >> 'huge=' option parameter nor introduce the mTHP interfaces control for >> tmpfs, but rather to change the default huge allocation behavior for >> tmpfs. That is to say, when 'huge=' option is not configured, we will >> allow the huge folios allocation based on the write size. As a result, >> the behavior of huge pages for tmpfs will change as follows: > > > no 'huge=' set: can allocate any size huge folios based on write size > > huge=never: no any size huge folios> huge=always: only PMD sized THP > allocation as before > > huge=fadvise: like "always" but only with fadvise/madvise> > huge=within_size: like "fadvise" but respect i_size > > I don't like that: > > (a) there is no way to explicitly enable/name that new behavior. But this is similar to other file systems that enable large folios (setting mapping_set_large_folios()), and I haven't seen any other file systems supporting large folios requiring a new Kconfig. Maybe tmpfs is a bit special? If we all agree that tmpfs is a bit special when using huge pages, then fine, a Kconfig option might be needed. > (b) "always" etc. are only concerned about PMDs. Yes, currently maintain the same semantics as before, in case users still expect THPs. > So again, I suggest: > > huge=never: No THPs of any size > huge=always: THPs of any size > huge=fadvise: like "always" but only with fadvise/madvise > huge=within_size: like "fadvise" but respect i_size > > "huge=" default depends on a Kconfig option. > > With that we: > > (1) Maximize the cases where we will use large folios of any sizes >     (which Willy cares about). > (2) Have a way to disable them completely (which I care about). > (3) Allow distros to keep the default unchanged. > > Likely, for now we will only try allocating PMD-sized THPs during page > faults, and allocate different sizes only during write(). So the effect > for many use cases (VMs, DBs) that primarily mmap() tmpfs files will be > completely unchanged even with "huge=always". > > It will get more tricky once we change that behavior as well, but that's > something to likely figure out if it is a real problem at at different > day :) > > > I really preferred using the sysfs toggles (as discussed with Hugh in > the meeting back then), but I can also understand why we at least want > to try making tmpfs behave more like other file systems. But I'm a bit > more careful to not ignore the cases where it really isn't like any > other file system. That's also my previous thought, but Matthew is strongly against that. Let's step by step. > If we start making PMD-sized THPs special in any non-configurable way, > then we are effectively off *worse* than allowing to configure them > properly. So if someone voices "but we want only PMD-sized" ones, the > next one will say "but we only want cont-pte sized-ones" and then we > should provide an option to control the actual sizes to use differently, > in some way. But let's see if that is even required. Yes, I agree. So what I am thinking is, the 'huge=' option should be gradually deprecated in the future and eventually tmpfs can allocate any size large folios as default.