From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D1C8BD2E00E for ; Thu, 24 Oct 2024 10:52:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 527216B0089; Thu, 24 Oct 2024 06:52:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D79F6B008A; Thu, 24 Oct 2024 06:52:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 378446B0092; Thu, 24 Oct 2024 06:52:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 1955B6B0089 for ; Thu, 24 Oct 2024 06:52:52 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id B18C941238 for ; Thu, 24 Oct 2024 10:52:41 +0000 (UTC) X-FDA: 82708182396.10.18EDF70 Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by imf01.hostedemail.com (Postfix) with ESMTP id 29BD440007 for ; Thu, 24 Oct 2024 10:52:34 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=kruces-com.20230601.gappssmtp.com header.s=20230601 header.b=HQXQy6KM; dmarc=none; spf=pass (imf01.hostedemail.com: domain of d@kruces.com designates 209.85.208.43 as permitted sender) smtp.mailfrom=d@kruces.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729767130; a=rsa-sha256; cv=none; b=Rba5n2jbyb+hIvkiB8j2eCCBtcx4PzeSPltm/6rX4ikaSjy6tBOwvHmZFR+xnKMsTwihOc ud3R23PYWNmlDKDxRgb2HJgrpTmPtyAqpK84FJmIAwh3NnUyjivB5tEe4/Hs8qypyt7Esr Xdihx9Zi0jBsfTkirarMUypo+LRE6fI= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=kruces-com.20230601.gappssmtp.com header.s=20230601 header.b=HQXQy6KM; dmarc=none; spf=pass (imf01.hostedemail.com: domain of d@kruces.com designates 209.85.208.43 as permitted sender) smtp.mailfrom=d@kruces.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729767130; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=J2jK48K5rMf9PM6EZMKUSQBND1vGsRxeS6yIIGOf0RE=; b=nte2qpvJmS6rMnN4HxbXaU9R5O9nwqZgsR+NsBduYQ8ZT28AP/oVQHL0eUtBLEpZSZCoWU OHI0VAkPI8Ese/7MQm3S/s2B7ROHCNeYCGGiie2/ujBpZ0iZEJ2drReS/wusw8BzJDQgaC lTFyOCWVxL90l00wqBe63oLHStK2w6E= Received: by mail-ed1-f43.google.com with SMTP id 4fb4d7f45d1cf-5c962c3e97dso946087a12.0 for ; Thu, 24 Oct 2024 03:52:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kruces-com.20230601.gappssmtp.com; s=20230601; t=1729767168; x=1730371968; darn=kvack.org; h=in-reply-to:references:from:to:cc:subject:message-id:date :content-transfer-encoding:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=J2jK48K5rMf9PM6EZMKUSQBND1vGsRxeS6yIIGOf0RE=; b=HQXQy6KM+I/aMHMC2H9p+ALuLXkVbU7eCw54lZcH7A61e90Wt+22pxenZJeMJLxUvC vazwlPajUOS5BqEEmUSFc31uJFRJNxY8C2pmh1yXlsdEQYeCMG5Es/fG7hzgk+s2B8zk fUe3HpnEDMP8HV4ZIR8I7/tZt5sjYuEEuHA9HB+NsaL3K4Xz69MwAEt/OytH9blrV3Df SQiEQOEIL/nQGYbjlXIC3AkuAddUZGaHm/7/5434qQ+z0D2npYxi5osZutDRXOJqjmvV wjitLVgkV6OyfkpV2nwKWgZE/EPde4dHSXlnKh2lZYV+fv8BWJtJwG5CO9sBsozO8d6o sRHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729767168; x=1730371968; h=in-reply-to:references:from:to:cc:subject:message-id:date :content-transfer-encoding:mime-version:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=J2jK48K5rMf9PM6EZMKUSQBND1vGsRxeS6yIIGOf0RE=; b=uKF5KKfDMFpHbOoPeJsnrwAZVF4pMS+/p7JLn/XOurF1oTGYIb89UUyIOi4s2/e9gq j2NF3AGoRt3uDfLLojCrtne/ZNdBRqUW3A84RAFNJoXI+KTMzQwsyG+89NHh6lmYgDn4 5F1scEOxGGfLJKQAJW/xGjdy9Kn4chCCWWAR7WUcSKbnQxLEPvLeAWI5L2ohgIqueq9A cd4517ALV1feu+M8IOx+qdwbHOl3h0XYxdCZwDWCwFp1xALOEsAZ5ZnErKW14TqgaG7n ZgcUf1bsR4hDMdcsBnm1JmHehLp0bVCqnwFSeBAu8ONjgMWs0sJ/L8n99nYJohsaNjd/ n/VQ== X-Forwarded-Encrypted: i=1; AJvYcCUrV7g+57wcZajJD2C1CEWhVl30Y9BnI3x0HktCXHeKUmYoQ//nON22XJKrm1MpYXj4bcb9NfB8xg==@kvack.org X-Gm-Message-State: AOJu0YyE3vE7Jn2475I6BQF+U6WxrGvvjqezHsxdD5ufmSGhJ/jr9F1T xb5MfLnHVaql8Ai+IzdfSTJlMu8yZQZ3WoN2FWlWKMkaeiSR/As7RVhe7lThY6A= X-Google-Smtp-Source: AGHT+IEnSf1hdbNIqPwxL30tQa/ozUeZocqJqjScDoLB3huUD5IaauOrOfS6UWSad94b31sfV703Ug== X-Received: by 2002:a05:6402:13cc:b0:5c5:b9bb:c65a with SMTP id 4fb4d7f45d1cf-5cb8ac2d6fbmr4475663a12.1.1729767168144; Thu, 24 Oct 2024 03:52:48 -0700 (PDT) Received: from localhost ([194.62.217.67]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5cb66c73b50sm5500642a12.97.2024.10.24.03.52.47 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 24 Oct 2024 03:52:47 -0700 (PDT) Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Thu, 24 Oct 2024 12:52:46 +0200 Message-Id: Subject: Re: [RFC PATCH v3 0/4] Support large folios for tmpfs Cc: "Matthew Wilcox" , , , , <21cnbao@gmail.com>, , , , , "Kirill A . Shutemov" To: "Daniel Gomez" , "David Hildenbrand" , "Baolin Wang" , "Daniel Gomez" , "Kirill A. Shutemov" From: "Daniel Gomez" X-Mailer: aerc 0.18.2 References: <6dohx7zna7x6hxzo4cwnwarep3a7rohx4qxubds3uujfb7gp3c@2xaubczl2n6d> <8e48cf24-83e1-486e-b89c-41edb7eeff3e@linux.alibaba.com> <486a72c6-5877-4a95-a587-2a32faa8785d@redhat.com> <7eb412d1-f90e-4363-8c7b-072f1124f8a6@linux.alibaba.com> <1b0f9f94-06a6-48ac-a68e-848bce1008e9@redhat.com> In-Reply-To: X-Rspam-User: X-Stat-Signature: k7yf96zt9ujianp8h1sbpbh7gu74mibh X-Rspamd-Queue-Id: 29BD440007 X-Rspamd-Server: rspam02 X-HE-Tag: 1729767154-651532 X-HE-Meta: U2FsdGVkX1+K7gp49LWHHa5d2m/8riYgQ0JWWNy5wksOmqxRK9frMbeP597DU1mVUNAcX7mZl/h8uCZzus7mXcpQ5s+yyUNaGMTIzt9fUUyVnTeDd04OuBgeiHDKEhHQtIroUBXPBQvlSGpBKjUJN64Bk3PQ0rLxMxeyrPXAyG4ymu4DYgGecnlZqxqtQooYUn6nbyw1N34rWjPlE1tkuGUQT6rbzayVZzkyl4lIfpAcwZlP4FtYNVM9eD9lq/x9muakRPQVq2/6bJxuYjq2r+H5SNEH1nPH9d09eo1Av5AlVfPNC67RSbnlRB5ySgRNg05SfKGkx3OhizjsRT+1+ugwGgTSfg93hqeFHdUQQZlgJCipRsVC0c+9YpS3PcrW/m/77zmBxKf5m7IsQORY7/wBGzkOc2mluqTU2aCrAkjn+6wQgm5OAq1rYpEHQ1uYnUOpJ7k/T6Owc9FC/DUcykvTCg2Xwwxg8z7TA2zKwXEayyjHftVEcE9f3LIPU6xWnmvbj/t136NRB7WzfTCK3aA/UA8osK76FiQqf+ky/GzPdXKHeDRVhlTNMXREztklkEVrIU4j17lcEYIqSoWiypZeFjvFEStcKSLDe6LoA0m5d/4wvQ8wi4faXEqziKQ2g4plMsi+0hA0igEdXnzvsszzgd9lKRbIpk/OUKN2RLSDqm0JhAYWznqvddXzh1iSUasOTCBeqj0l0Y1ElYp43RVmlhqkW6mJL7wyQkjqlejXDIjUH4PPwkgx4nb7az1lhriSsp/PaU4T7Ehmdg06p6CnVNdMobswh2I8vrVGUgMvac+7LFfXs6hS5lX0SohTOw4jXsaRmiHVZnSUspKr3lh3ZyHgwyn2Avmqd8xyn/PdiCo/TT4cDpCaLiuEoUvYmktEvrLeXGgKhNI2riLPGpb8GAkIFRs0yIyWwMAIWD/ynlSt2TFPrfMhFSJftCwoHF7j8o82qCKHpM11h54 9gr9EsVh O52HdRz/6QeUhDN9cExLDqS+crOeasDw0PNF2SrxHcuRuPLCXJ7FhDwP4mdzDBWqiVbZzM/2EJ7c6tgzHeANkKHq2puR7iCavqQboBgvfjqcihXGh+/KkRQLbCZkmwhkJjUerA9aFMPo3uADx2PzY0lmd0PqFr+tGkWpUqEh2JFs8x6B3r+DZlo/YEUTRhFV6uROUjw6UOc4VwPFFv4Ok44LNf49CKk3QJsw2tzYDl4iGivANbVWzCWDsg7h4G6uAZEMcGI8oFu/aDqiSJjTw/P3XNqG0SCaV/1H/6pLX7m0c97cziJSyUanVtm8gr87JY/cmZikjyI5ccZtBUq0vS8TQriZQ++VqxVujJopbMKa1sBowvJHKcClMjqiNZb/R4w2laoH57upWUhAOhUmsTxUddNX6O/CLe9fD92EbGaDwC+XgFH6CTtGw+TvGc85rQ3J21loHmdHclpIq/c6FiVA2lzXp20cGoXabMvc4jF814cc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu Oct 24, 2024 at 12:49 PM CEST, Daniel Gomez wrote: > On Wed Oct 23, 2024 at 11:27 AM CEST, David Hildenbrand wrote: > > On 23.10.24 10:04, Baolin Wang wrote: > > >=20 > > >=20 > > > On 2024/10/22 23:31, David Hildenbrand wrote: > > >> On 22.10.24 05:41, Baolin Wang wrote: > > >>> > > >>> > > >>> On 2024/10/21 21:34, Daniel Gomez wrote: > > >>>> On Mon Oct 21, 2024 at 10:54 AM CEST, Kirill A. Shutemov wrote: > > >>>>> On Mon, Oct 21, 2024 at 02:24:18PM +0800, Baolin Wang wrote: > > >>>>>> > > >>>>>> > > >>>>>> On 2024/10/17 19:26, Kirill A. Shutemov wrote: > > >>>>>>> On Thu, Oct 17, 2024 at 05:34:15PM +0800, Baolin Wang wrote: > > >>>>>>>> + Kirill > > >>>>>>>> > > >>>>>>>> On 2024/10/16 22:06, Matthew Wilcox wrote: > > >>>>>>>>> On Thu, Oct 10, 2024 at 05:58:10PM +0800, Baolin Wang wrote: > > >>>>>>>>>> Considering that tmpfs already has the 'huge=3D' option to > > >>>>>>>>>> control the THP > > >>>>>>>>>> allocation, it is necessary to maintain compatibility with t= he > > >>>>>>>>>> 'huge=3D' > > >>>>>>>>>> option, as well as considering the 'deny' and 'force' option > > >>>>>>>>>> controlled > > >>>>>>>>>> by '/sys/kernel/mm/transparent_hugepage/shmem_enabled'. > > >>>>>>>>> > > >>>>>>>>> No, it's not.=C2=A0 No other filesystem honours these setting= s. > > >>>>>>>>> tmpfs would > > >>>>>>>>> not have had these settings if it were written today.=C2=A0 I= t should > > >>>>>>>>> simply > > >>>>>>>>> ignore them, the way that NFS ignores the "intr" mount option > > >>>>>>>>> now that > > >>>>>>>>> we have a better solution to the original problem. > > >>>>>>>>> > > >>>>>>>>> To reiterate my position: > > >>>>>>>>> > > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 - When using tmpfs as a filesystem,= it should behave like > > >>>>>>>>> other > > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 filesystems. > > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 - When using tmpfs to implement MAP= _ANONYMOUS | MAP_SHARED, > > >>>>>>>>> it should > > >>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 behave like anonymous m= emory. > > >>>>>>>> > > >>>>>>>> I do agree with your point to some extent, but the =E2=80=98hu= ge=3D=E2=80=99 option > > >>>>>>>> has > > >>>>>>>> existed for nearly 8 years, and the huge orders based on write > > >>>>>>>> size may not > > >>>>>>>> achieve the performance of PMD-sized THP in some scenarios, su= ch > > >>>>>>>> as when the > > >>>>>>>> write length is consistently 4K. So, I am still concerned that > > >>>>>>>> ignoring the > > >>>>>>>> 'huge' option could lead to compatibility issues. > > >>>>>>> > > >>>>>>> Yeah, I don't think we are there yet to ignore the mount option= . > > >>>>>> > > >>>>>> OK. > > >>>>>> > > >>>>>>> Maybe we need to get a new generic interface to request the sem= antics > > >>>>>>> tmpfs has with huge=3D on per-inode level on any fs. Like a set= of > > >>>>>>> FADV_* > > >>>>>>> handles to make kernel allocate PMD-size folio on any allocatio= n > > >>>>>>> or on > > >>>>>>> allocations within i_size. I think this behaviour is useful bey= ond > > >>>>>>> tmpfs. > > >>>>>>> > > >>>>>>> Then huge=3D implementation for tmpfs can be re-defined to set = these > > >>>>>>> per-inode FADV_ flags by default. This way we can keep tmpfs > > >>>>>>> compatible > > >>>>>>> with current deployments and less special comparing to rest of > > >>>>>>> filesystems on kernel side. > > >>>>>> > > >>>>>> I did a quick search, and I didn't find any other fs that requir= e > > >>>>>> PMD-sized > > >>>>>> huge pages, so I am not sure if FADV_* is useful for filesystems > > >>>>>> other than > > >>>>>> tmpfs. Please correct me if I missed something. > > >>>>> > > >>>>> What do you mean by "require"? THPs are always opportunistic. > > >>>>> > > >>>>> IIUC, we don't have a way to hint kernel to use huge pages for a > > >>>>> file on > > >>>>> read from backing storage. Readahead is not always the right way. > > >>>>> > > >>>>>>> If huge=3D is not set, tmpfs would behave the same way as the r= est of > > >>>>>>> filesystems. > > >>>>>> > > >>>>>> So if 'huge=3D' is not set, tmpfs write()/fallocate() can still > > >>>>>> allocate large > > >>>>>> folios based on the write size? If yes, that means it will chang= e the > > >>>>>> default huge behavior for tmpfs. Because previously having 'huge= =3D' > > >>>>>> is not > > >>>>>> set means the huge option is 'SHMEM_HUGE_NEVER', which is simila= r > > >>>>>> to what I > > >>>>>> mentioned: > > >>>>>> "Another possible choice is to make the huge pages allocation ba= sed > > >>>>>> on write > > >>>>>> size as the *default* behavior for tmpfs, ..." > > >>>>> > > >>>>> I am more worried about breaking existing users of huge pages. So > > >>>>> changing > > >>>>> behaviour of users who don't specify huge is okay to me. > > >>>> > > >>>> I think moving tmpfs to allocate large folios opportunistically by > > >>>> default (as it was proposed initially) doesn't necessary conflict = with > > >>>> the default behaviour (huge=3Dnever). We just need to clarify that= in > > >>>> the documentation. > > >>>> > > >>>> However, and IIRC, one of the requests from Hugh was to have a way= to > > >>>> disable large folios which is something other FS do not have contr= ol > > >>>> of as of today. Ryan sent a proposal to actually control that glob= ally > > >>>> but I think it didn't move forward. So, what are we missing to go = back > > >>>> to implement large folios in tmpfs in the default case, as any oth= er fs > > >>>> leveraging large folios? > > >>> > > >>> IMHO, as I discussed with Kirill, we still need maintain compatibil= ity > > >>> with the 'huge=3D' mount option. This means that if 'huge=3Dnever' = is set > > >>> for tmpfs, huge page allocation will still be prohibited (which can > > >>> address Hugh's request?). However, if 'huge=3D' is not set, we can > > >>> allocate large folios based on the write size. > > So, in order to make tmpfs behave like other filesystems, we need to > allocate large folios by default. Not setting 'huge=3D' is the same as > setting it to 'huge=3Dnever' as per documentation. But 'huge=3D' is meant= to > control THP, not large folios, so it should not have a conflict here, or > else, what case are you thinking? > > So, to make tmpfs behave like other filesystems, we need to allocate > large folios by default. According to the documentation, not setting > 'huge=3D' is the same as setting 'huge=3Dnever.' However, 'huge=3D' is > intended to control THP, not large folios, so there shouldn't be > a conflict in this case. Can you clarify what specific scenario or > conflict you're considering here? Perhaps when large folios order is the > same as PMD-size? Sorry for duplicate paragraph. > > > >> > > >> I consider allocating large folios in shmem/tmpfs on the write path = less > > >> controversial than allocating them on the page fault path -- especia= lly > > >> as long as we stay within the size to-be-written. > > >> > > >> I think in RHEL THP on shmem/tmpfs are disabled as default (e.g., > > >> shmem_enabled=3Dnever). Maybe because of some rather undesired > > >> side-effects (maybe some are historical?): I recall issues with VMs = with > > >> THP+ memory ballooning, as we cannot reclaim pages of folios if > > >> splitting fails). I assume most of these problematic use cases don't= use > > >> tmpfs as an ordinary file system (write()/read()), but mmap() the wh= ole > > >> thing. > > >> > > >> Sadly, I don't find any information about shmem/tmpfs + THP in the R= HEL > > >> documentation; most documentation is only concerned about anon THP. > > >> Which makes me conclude that they are not suggested as of now. > > >> > > >> I see more issues with allocating them on the page fault path and no= t > > >> having a way to disable it -- compared to allocating them on the wri= te() > > >> path. > > >=20 > > > I may not understand your issues. IIUC, you can disable allocating hu= ge > > > pages on the page fault path by using the 'huge=3Dnever' mount option= or > > > setting shmem_enabled=3Ddeny. No? > > > > That's what I am saying: if there is some way to disable it that will= =20 > > keep working, great. > > I agree. That aligns with what I recall Hugh requested. However, I > believe if that is the way to go, we shouldn't limit it to tmpfs. > Otherwise, why should tmpfs be prevented from allocating large folios if > other filesystems in the system are allowed to allocate them? I think, > if we want to disable large folios we should make it more generic, > something similar to Ryan's proposal [1] for controlling folio sizes. > > [1] https://lore.kernel.org/all/20240717071257.4141363-1-ryan.roberts@arm= .com/ > > That said, there has already been disagreement on this point here [2]. > > [2] https://lore.kernel.org/all/ZvVRiJYfaXD645Nh@casper.infradead.org/