From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E467D2C542 for ; Thu, 24 Oct 2024 10:49:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C9A546B0083; Thu, 24 Oct 2024 06:49:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C23096B0085; Thu, 24 Oct 2024 06:49:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC3FC6B0088; Thu, 24 Oct 2024 06:49:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8EE9A6B0083 for ; Thu, 24 Oct 2024 06:49:15 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4E5B18119D for ; Thu, 24 Oct 2024 10:48:59 +0000 (UTC) X-FDA: 82708173282.21.B05AA3D Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) by imf21.hostedemail.com (Postfix) with ESMTP id 0BAF41C000F for ; Thu, 24 Oct 2024 10:48:37 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=kruces-com.20230601.gappssmtp.com header.s=20230601 header.b=ATpfE+Ld; spf=pass (imf21.hostedemail.com: domain of d@kruces.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=d@kruces.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729766901; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4D6nfgZG6k1+7ANic9oQcEiK3B/zcbyldSedO3fEu+0=; b=N0cpERo6hNiLEqS/wIYVDKclgXkbAcRn9y9l7ODeReKyFxsRC4FK+cLY33zz9aOsai5+zD m6EhrHmknszVxLpcarRmasNMl5LVpoGC0AyMwDj3U6kWvwxyPHDYJ0EhW0MBAB8mdX+jEC ekdA2kviKELVR6+sqnsQGdc+bHthfoI= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=kruces-com.20230601.gappssmtp.com header.s=20230601 header.b=ATpfE+Ld; spf=pass (imf21.hostedemail.com: domain of d@kruces.com designates 209.85.218.49 as permitted sender) smtp.mailfrom=d@kruces.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729766901; a=rsa-sha256; cv=none; b=VvEjAljNRGr5NZ0H8m65Y3HDUqr7qRMO/N0HZGHEZJcT5e7a6CVIwFdZzqpSmTC0b7Z2OB m4yRDm0uXUZivvy4CItsB2d7g40Zvx4cLN1oJ6rp+K/1s9t17f27cfO8S+m38mvdhDLUeq 9eYP1jDqp32ctNRyfxvnaxjrFbPWfHI= Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-a9a6acac4c3so107537766b.0 for ; Thu, 24 Oct 2024 03:49:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kruces-com.20230601.gappssmtp.com; s=20230601; t=1729766951; x=1730371751; darn=kvack.org; h=in-reply-to:references:cc:subject:from:to:message-id:date :content-transfer-encoding:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4D6nfgZG6k1+7ANic9oQcEiK3B/zcbyldSedO3fEu+0=; b=ATpfE+LdLkLsM/wlM3dtjVb6zy0KNtCtMnPG1W5kYFpAgfFMs4nkh0xEoRvYJbQKHx qmVT10CO0/gt2JzxepKEHbL+jkYYbsltrGJOabDU/dGaYpoVg6CZ1G3cmyU+mQxTQ0Sk p8SA2mzPQ0sWjNP3F/BExQK0s3woxtBwmo8cWcN84I1VoR7tWG39/NdcE400w7beS7vZ FgZd7UEwMMgLmrtaIsLAjjEPadaGYQgk7Fjbqn+9PBJV46te+qsz2nmHbgjdR2PGrLUX YVUwll4VAVLh4a83tDsoJfOvgXQ+lXBE4NAItsZnv5nVQThPpOBmvN1stpfbmYePrdJ3 Mkwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729766951; x=1730371751; h=in-reply-to:references:cc:subject:from:to:message-id:date :content-transfer-encoding:mime-version:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=4D6nfgZG6k1+7ANic9oQcEiK3B/zcbyldSedO3fEu+0=; b=sx+0aNArX7WysMX8LQgtmmTRxqYypIKZGymF2H3sUGy7gJY9CKE/rr8tMJjAUshMNn Dv//r8O49ssU1DhDk6wBXX0KMrYMp/xUkTThC3da8TfLnqmM2J0KIGvqU/zyMebqDX02 fntxIjRqGMIWLzFy2eJbuv/mCwF4+Fy6kXZm6by1jD3ePWTZr4GbXk1RIjUiJOfWqiIK vG7F6rm3O2+6z9RnC1WUI90F7xgk3lY7utJScLWFyQKt9D92+wpUWRZYYrvX4Wqz5JAa 2EbzkGQhrfPh0h0CsUTfA6L4AVwW6mPuL6D9070lMQ/BCRwhFFB1YbcNX8p/gmVBkm3r udpQ== X-Forwarded-Encrypted: i=1; AJvYcCXg6jYfcHOrSYzjOKfc3XtgreYx7VhqLKJpQ4e0TQ/N+8SS+6K1/Y5fW/vSupZjTLtqa0+ByAcXSg==@kvack.org X-Gm-Message-State: AOJu0Yznv3xNxLGgdlu+GyYcC2kCnWziSnNPMUIwOl90Uuc9NIoapXpo FZOrrwXUyZVTCuhuaGKywrQvKDF0IpWI8HOhOvx1PhjDRYrav0x//GUR1uLG3Hc= X-Google-Smtp-Source: AGHT+IFTS0CFhCVOQh82+3UorX4rsLbvWN28RiSdZKLEiH//znH8/4KpGPi7VYCl54OK27wN3SXc2g== X-Received: by 2002:a17:907:7f24:b0:a9a:1094:55de with SMTP id a640c23a62f3a-a9ad27127d2mr148079966b.13.1729766951262; Thu, 24 Oct 2024 03:49:11 -0700 (PDT) Received: from localhost ([194.62.217.67]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a9a91370e9csm600816666b.136.2024.10.24.03.49.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 24 Oct 2024 03:49:10 -0700 (PDT) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Thu, 24 Oct 2024 12:49:09 +0200 Message-Id: To: "David Hildenbrand" , "Baolin Wang" , "Daniel Gomez" , "Kirill A. Shutemov" From: "Daniel Gomez" Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs Cc: "Matthew Wilcox" , , , , <21cnbao@gmail.com>, , , , , "Kirill A . Shutemov" X-Mailer: aerc 0.18.2 References: <6dohx7zna7x6hxzo4cwnwarep3a7rohx4qxubds3uujfb7gp3c@2xaubczl2n6d> <8e48cf24-83e1-486e-b89c-41edb7eeff3e@linux.alibaba.com> <486a72c6-5877-4a95-a587-2a32faa8785d@redhat.com> <7eb412d1-f90e-4363-8c7b-072f1124f8a6@linux.alibaba.com> <1b0f9f94-06a6-48ac-a68e-848bce1008e9@redhat.com> In-Reply-To: <1b0f9f94-06a6-48ac-a68e-848bce1008e9@redhat.com> X-Rspam-User: X-Stat-Signature: 7em1a358osndojkjh5jisuks4mffzbk9 X-Rspamd-Queue-Id: 0BAF41C000F X-Rspamd-Server: rspam11 X-HE-Tag: 1729766917-950372 X-HE-Meta: U2FsdGVkX1+5mt0zG08uz6G6j/XGrCoDa2AP0BXWLkqkIa7u32j5eu7zJ5JhxzIqvRnY/q8QdCLLUuUuzcZKKDHe861W/mx1Nd4A4eDZ94spd+SyMxli1OxLieef/3pokdV5YLnogb6XIaHjjPMGIPynQC6tmr1Ng1HJ8k8cFlov1mfEdqpk/2Pnvqujolcl6bQduA2FS6DYn3BdEX9Z4sivuymc4GNHmod2VDDQHvwcj+5lobCh9YVzWulzlhfAXCNeqivfXd9Ki/pgUc8SXuOuSuaTIE63jNcO0Qoo9GH2BJnH1HrMfDV+YA7HQ5T/O9RCX4E9DRLBiC09g3PGfw48X7MgRGFRIiHWrkYs86Bl82BhEUtHq4hv1k193NkuDjSLQpK3lY65RmWShPG0m4ldxoId4N/FwmRTmDxluQZI8a8Sc+1baJKg6jATVb9FIhU7wEGj3+sSdVjWiUYYGJMybKtHpTY/z2BK0mNDdWAMmYsChVZWoI8g626pG1R0TZzNegfNM415M8eThW0em2cLlPvY0fh3ZqkU03PcTntMxS7K8jjXiPJzHyXJLZQMgw7gA8vqAjY20UUZEA+1mNI6o/ZKAfm+xJAdQVek5UZO4U+T0k5HLPLxDT6TivnzI9qWpCIuXJ/sgR3Tz5kR6EB7MflhfFIrW2tR9Sb5l0p/yJ1j9780Ol77JWyrBoeQAsW2Z+N8EVb5wDLh1Wi27+Jpmh6750E1qe3ZTy0Syf3reVIkgxJg1ICPbYcCPy91MiKn4X0chrgjrQ/AtNYXmcdkgRrObFQW0i0N3wQnNLEMx+7FlHWb9z7cKeNTpnw2tq8FKFmtx9OpD4DAqaqSOfr4VVckANe8Z2YJZCUXhqaHE45N71/EsZLRG6IpvyQ6Ip/ER1stPFAYKRSsyBbAjQNfo66sK5i6rl3Hg3sp9oyQBRsXBk6BHK6nVEagPTgnMf9oK6hsmbeC7rwW9E5 0A2QOduj +FT0XsL6b4qRlLJ+7TJBgZgGYMKvK5K/KBQsjs8dcyk4fTPTqBsuzkNFp9xrX6KtP9/c1xxTChzOxFbrzPT819YIDrM2ZPsxiyXo/KQqPnKA79deXOPMcdWKPxu+UvlJRzDCYqHnT1aICyY0ikcdLTMe54powi1iUad+FYxUHx3qdWltgRxW1lqWGWDknYaCW5AyeGDH0M25f0ylkLvb24JWNW903hrDbz6Od+3Ln6gjXHS4krM/YPOIaeliTAg+jBYilD8DIvA35DLMLMUyzkI1c5L/kXTLBVh8DePypDIS01d1UVT/z7/i0sk2K5PwYTg7S9IsMknW5vLB9Le8iiStZvS7ZyAwbMVauE7srCC2G3qGXrxa5iGI/H/C90QaN+VY0MMPFrmjyz6gBq/7zx8H2k+aImE48usAMk/FjDSb2s4UpG5vF7BLRyJshGpA/tSA8iF1cK6/Z3oa4tir4hZ2Du6nvG/CYOqACoip5aldfRTM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote: > On 23.10.24 10:04, Baolin Wang wrote: > >=20 > >=20 > > On 2024/10/22 23:31, David Hildenbrand wrote: > >> On 22.10.24 05:41, Baolin Wang wrote: > >>> > >>> > >>> On 2024/10/21 21:34, Daniel Gomez wrote: > >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: > >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: > >>>>>> > >>>>>> > >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: > >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: > >>>>>>>> + Kirill > >>>>>>>> > >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: > >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: > >>>>>>>>>> Considering that tmpfs already has the 'huge=3D' option to > >>>>>>>>>> control the THP > >>>>>>>>>> allocation, it is necessary to maintain compatibility with the > >>>>>>>>>> 'huge=3D' > >>>>>>>>>> option, as well as considering the 'deny' and 'force' option > >>>>>>>>>> controlled > >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > >>>>>>>>> > >>>>>>>>> No, it's not.=C2=A0 No other filesystem honours these settings. > >>>>>>>>> tmpfs would > >>>>>>>>> not have had these settings if it were written today.=C2=A0 It = should > >>>>>>>>> simply > >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option > >>>>>>>>> now that > >>>>>>>>> we have a better solution to the original problem. > >>>>>>>>> > >>>>>>>>> To reiterate my position: > >>>>>>>>> > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 - When using tmpfs as a filesystem, i= t should behave like > >>>>>>>>> other > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 filesystems. > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 - When using tmpfs to implement MAP_A= NONYMOUS | MAP_SHARED, > >>>>>>>>> it should > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 behave like anonymous mem= ory. > >>>>>>>> > >>>>>>>> I do agree with your point to some extent, but the =E2=80=98huge= =3D=E2=80=99 option > >>>>>>>> has > >>>>>>>> existed for nearly 8 years, and the huge orders based on write > >>>>>>>> size may not > >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, such > >>>>>>>> as when the > >>>>>>>> write length is consistently 4K. So, I am still concerned that > >>>>>>>> ignoring the > >>>>>>>> 'huge' option could lead to compatibility issues. > >>>>>>> > >>>>>>> Yeah, I don't think we are there yet to ignore the mount option. > >>>>>> > >>>>>> OK. > >>>>>> > >>>>>>> Maybe we need to get a new generic interface to request the seman= tics > >>>>>>> tmpfs has with huge=3D on per-inode level on any fs. Like a set o= f > >>>>>>> FADV_* > >>>>>>> handles to make kernel allocate PMD-size folio on any allocation > >>>>>>> or on > >>>>>>> allocations within i_size. I think this behaviour is useful beyon= d > >>>>>>> tmpfs. > >>>>>>> > >>>>>>> Then huge=3D implementation for tmpfs can be re-defined to set th= ese > >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs > >>>>>>> compatible > >>>>>>> with current deployments and less special comparing to rest of > >>>>>>> filesystems on kernel side. > >>>>>> > >>>>>> I did a quick search, and I didn't find any other fs that require > >>>>>> PMD-sized > >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems > >>>>>> other than > >>>>>> tmpfs. Please correct me if I missed something. > >>>>> > >>>>> What do you mean by "require"? THPs are always opportunistic. > >>>>> > >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a > >>>>> file on > >>>>> read from backing storage. Readahead is not always the right way. > >>>>> > >>>>>>> If huge=3D is not set, tmpfs would behave the same way as the res= t of > >>>>>>> filesystems. > >>>>>> > >>>>>> So if 'huge=3D' is not set, tmpfs write()/fallocate() can still > >>>>>> allocate large > >>>>>> folios based on the write size? If yes, that means it will change = the > >>>>>> default huge behavior for tmpfs. Because previously having 'huge= =3D' > >>>>>> is not > >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is similar > >>>>>> to what I > >>>>>> mentioned: > >>>>>> "Another possible choice is to make the huge pages allocation base= d > >>>>>> on write > >>>>>> size as the *default* behavior for tmpfs, ..." > >>>>> > >>>>> I am more worried about breaking existing users of huge pages. So > >>>>> changing > >>>>> behaviour of users who don't specify huge is okay to me. > >>>> > >>>> I think moving tmpfs to allocate large folios opportunistically by > >>>> default (as it was proposed initially) doesn't necessary conflict wi= th > >>>> the default behaviour (huge=3Dnever). We just need to clarify that i= n > >>>> the documentation. > >>>> > >>>> However, and IIRC, one of the requests from Hugh was to have a way t= o > >>>> disable large folios which is something other FS do not have control > >>>> of as of today. Ryan sent a proposal to actually control that global= ly > >>>> but I think it didn't move forward. So, what are we missing to go ba= ck > >>>> to implement large folios in tmpfs in the default case, as any other= fs > >>>> leveraging large folios? > >>> > >>> IMHO, as I discussed with Kirill, we still need maintain compatibilit= y > >>> with the 'huge=3D' mount option. This means that if 'huge=3Dnever' is= set > >>> for tmpfs, huge page allocation will still be prohibited (which can > >>> address Hugh's request?). However, if 'huge=3D' is not set, we can > >>> allocate large folios based on the write size. So, in order to make tmpfs behave like other filesystems, we need to allocate large folios by default. Not setting 'huge=3D' is the same as setting it to 'huge=3Dnever' as per documentation. But 'huge=3D' is meant t= o control THP, not large folios, so it should not have a conflict here, or else, what case are you thinking? So, to make tmpfs behave like other filesystems, we need to allocate large folios by default. According to the documentation, not setting 'huge=3D' is the same as setting 'huge=3Dnever.' However, 'huge=3D' is intended to control THP, not large folios, so there shouldn't be a conflict in this case. Can you clarify what specific scenario or conflict you're considering here? Perhaps when large folios order is the same as PMD-size? > >> > >> I consider allocating large folios in shmem/tmpfs on the write path le= ss > >> controversial than allocating them on the page fault path -- especiall= y > >> as long as we stay within the size to-be-written. > >> > >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., > >> shmem_enabled=3Dnever). Maybe because of some rather undesired > >> side-effects (maybe some are historical?): I recall issues with VMs wi= th > >> THP+ memory ballooning, as we cannot reclaim pages of folios if > >> splitting fails). I assume most of these problematic use cases don't u= se > >> tmpfs as an ordinary file system (write()/read()), but mmap() the whol= e > >> thing. > >> > >> Sadly, I don't find any information about shmem/tmpfs + THP in the RHE= L > >> documentation; most documentation is only concerned about anon THP. > >> Which makes me conclude that they are not suggested as of now. > >> > >> I see more issues with allocating them on the page fault path and not > >> having a way to disable it -- compared to allocating them on the write= () > >> path. > >=20 > > I may not understand your issues. IIUC, you can disable allocating huge > > pages on the page fault path by using the 'huge=3Dnever' mount option o= r > > setting shmem_enabled=3Ddeny. No? > > That's what I am saying: if there is some way to disable it that will=20 > keep working, great. I agree. That aligns with what I recall Hugh requested. However, I believe if that is the way to go, we shouldn't limit it to tmpfs. Otherwise, why should tmpfs be prevented from allocating large folios if other filesystems in the system are allowed to allocate them? I think, if we want to disable large folios we should make it more generic, something similar to Ryan's proposal [1] for controlling folio sizes. [1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm.c= om/ That said, there has already been disagreement on this point here [2]. [2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/