From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C932EB64DC for ; Tue, 11 Jul 2023 00:50:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9DAB48D0002; Mon, 10 Jul 2023 20:50:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 98A218D0001; Mon, 10 Jul 2023 20:50:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 852388D0002; Mon, 10 Jul 2023 20:50:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 71E958D0001 for ; Mon, 10 Jul 2023 20:50:01 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 30AF3160285 for ; Tue, 11 Jul 2023 00:50:01 +0000 (UTC) X-FDA: 80997499002.06.B6AAD56 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by imf25.hostedemail.com (Postfix) with ESMTP id E8896A0016 for ; Tue, 11 Jul 2023 00:49:57 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UZQEG3Iz; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689036599; a=rsa-sha256; cv=none; b=kA3M3yHOxO3YeTmOSFok6mWlcwWElkyf7ggL2NYLdXh4KAIVoHsWcRSeOEf/Z2FLWvMhDr qbRddaOC4Zb9M2HJjfJpZAmgw0hfQNWIH25s6WGhfjR0Kk0EDYgnKWozmw5DI5duBKaHgK wxZ3JGX+cBUg6sIHUjvRTRJ67PEUy0c= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UZQEG3Iz; spf=pass (imf25.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689036599; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4h7hEUMg5MfSYLHm2FBOksY8eBx7Qt4eaUM6LX+/2Rk=; b=Mc06kYVj2PSR0sWdUnfXGhg9QXnhrEZot47d2YxC6SgH76zyhdjD/AgexTflt470HcStmF oLTaiWur9jY+Q4MBKR3GSaSRRPzJW7aHoshZURMmaxvoJxrKEEH5cjDqSWpjIyENTqxqJA n3Ow7lA7QVKkdggWo9+s8Lz3h0YNwL8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1689036599; x=1720572599; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version:content-transfer-encoding; bh=UZgv5SNKTccA6FLxbfEckb4OEzHKeJEe92vi+aV97zw=; b=UZQEG3IzeoCWi718QVFwFUWf+y0+4N5igdddSAUfTBoUHacey6ov/gTc R0JA3z824gS76YCt5lyoG1sqfBJhRlt8GqnUojj3FBjlL2bpDWq2ma9ba OuxKbXZ3+SNec+NlUSR9PgjtV7GYf51ByyAP3p0Q7QZ/bVJrWRKtF8ctQ a+wZAp2W69SwN+Sv4FX0oDQ1rk8zYS01gNXOAe2x8Difvt5W8LKb91PFO 21JviNCxB+tOsAi3zaPQYCY3m0FgE8qFGSUgytPQLUsNMWgidI6grs7O6 9lwuEdnKdnchgUOsX4zkQJpo6j/FDkQlQdvv1b/19TWI74re+OOwjrfJr Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10767"; a="361956159" X-IronPort-AV: E=Sophos;i="6.01,195,1684825200"; d="scan'208";a="361956159" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Jul 2023 17:49:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10767"; a="894981500" X-IronPort-AV: E=Sophos;i="6.01,195,1684825200"; d="scan'208";a="894981500" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Jul 2023 17:49:53 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: David Hildenbrand , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Yin Fengwei , Yu Zhao , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , , , , Johannes Weiner , Alexander Zhu Subject: Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance References: <20230703135330.1865927-1-ryan.roberts@arm.com> <20230703135330.1865927-5-ryan.roberts@arm.com> <87edlkgnfa.fsf@yhuang6-desk2.ccr.corp.intel.com> <44e60630-5e9d-c8df-ab79-cb0767de680e@arm.com> <524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com> <1e406f04-78ef-6573-e1f1-f0d0e0d5246a@redhat.com> <9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com> <87y1jofoyi.fsf@yhuang6-desk2.ccr.corp.intel.com> <6c2f3127-9334-85ba-48f6-83a9c87abde0@arm.com> <874jmcf7kh.fsf@yhuang6-desk2.ccr.corp.intel.com> <501d1a6a-c9cf-94c3-b773-c648b944b30f@arm.com> Date: Tue, 11 Jul 2023 08:48:16 +0800 In-Reply-To: <501d1a6a-c9cf-94c3-b773-c648b944b30f@arm.com> (Ryan Roberts's message of "Mon, 10 Jul 2023 10:25:36 +0100") Message-ID: <87v8ere0j3.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E8896A0016 X-Stat-Signature: dhq6w4ymzkrmng9wceceukumsdst18de X-Rspam-User: X-HE-Tag: 1689036597-227181 X-HE-Meta: U2FsdGVkX19pJcG0yx7jw/8mOL8zZDLBD/NN6hnioyei/3/LuFmCW6T+SrS+6bFrdbP9mvPb0qYPjBlILLKclHAWH7bXWyErde8kXT8FLcvWOwejczOPnrNuTnypJf6KY8Ma1w6uQVG1vO1TftD+QIFLz2ttBfuJjUnjMkTCPLngi3xTQdSaS5lxPz69bJlhQvYx0g2wwYixPec6/EmhyskhTJB3kL/DiOUcX7X/nKi6gEbEoUVWj2pQ9by3z21wY+orRJf9dcr7wlnR+yKBPKKL6FdAtJeUUjeIqFa4204dAN5wmoVpnMbh3cTLN5nJrHorsoSmv0U/uRtbTYDVyKG+9Sviw76UYGhP+LNYJ8GJumVm1YQwJLWUKnVwrVULtTPFTiC64YV1R/OMmm0VmRZcySF52jjtAFG8PsovUU+7bCow1yYGo9TVvfWAPe/I/oRtOPJg3TRTGt0uIRJrVuHFJUJbdZ1j52TjQlm7mjpdV5KVlMgXJ/hwChZ+kSf8163G8OkP4Ft/meGitJVOR+D17JunoP0YbC7ck9M/W+9zl0fTJTl9P7z3vjQl9u3bHvURSZTcxr2qkN6q+KHMlbOyCn8B/tDOZnTcY1T64qxfqFRX03KUWOnaPAdsavLYMJxySy6+oZsrMWbbFaJVnZBuWwXfeB3gqkaQwMHk+IljLQQhfa7MYZHtgil53NDXo328loh70a8rEAg2iQmXq9USIY0mONVcEPrxJ2nsrsl0HoPploDLhpUO2K69000VX7PMGYuXpY0quyl4r4rMWEa4Ht7H79PgdSDi7FJaRdy5uk1SvuCQbLJgfeNzF0dAO2JARwCKd32PEOM8Y4S8/WeHf2fm6zO6RLkmxHxvcN9s6bChAAOfyQ2NmHiP3J6i3bDeVqD/1lB1Rcqz5PDaFBem3YDeMzrJrdXAAE+gzSNYXNfRBgpsnOPioyXV6jtmUyQf6UiV3AJiyirXobw +6mH9Unx ecYf9BpxO8AB41SDO6hZMqKl/O01cQpWlL7ySfOQxXugmCELWI22bmYiXb+xH6WwgWd1sr8KuEU84vX0kNi4TIVFN3zVLxYUXaEAwpy0dVBZ9UM5AmsiuMEGCBGTyMD5iZ/FRm9VHG9i2r5AXHBivf0hSPZT/qjjSOx6uIQvk4ICnuVmn9B6AQFvyCtpacEGokl6xE2xpQE0Qcesfp1T6VeZtsC4mJZt+FomA7jYzSlERzvvpjt+k6MvIO299TBBh7PSa+GsYVUImp0f0NAM/NqboNyKirzwoPZ4D8/v3nTQQrxNsh7jQw81wVXN+it8eQVZ8dj08sHbxHaAmQkRYvuh/UGcDXRgO3gkToRt8fBqxlJV1bjQGhoJbw+GZ3Zvq9MuIpTmxby3XpKlJ9RtpC0WEwrKDeI9ZBuSSxeE6q0Ld7067jgDqos1EZg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Ryan Roberts writes: > On 10/07/2023 10:18, Huang, Ying wrote: >> Ryan Roberts writes: >>=20 >>> On 10/07/2023 04:03, Huang, Ying wrote: >>>> Ryan Roberts writes: >>>> >>>>> On 07/07/2023 15:07, David Hildenbrand wrote: >>>>>> On 07.07.23 15:57, Matthew Wilcox wrote: >>>>>>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>>>>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>>>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's ha= rd to >>>>>>>>>> avoid internal fragmentation completely.=C2=A0 So, I think that = finally we >>>>>>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>>>>>> something like "always madvise never" via >>>>>>>>>> /sys/kernel/mm/transparent_hugepage/enabled.=C2=A0 I'm not sure = whether it's >>>>>>>>>> a good idea to reuse the existing interface of THP. >>>>>>>>> >>>>>>>>> I wouldn't want to tie this to the existing interface, simply bec= ause that >>>>>>>>> implies that we would want to follow the "always" and "madvise" a= dvice too; >>>>>>>>> That >>>>>>>>> means that on a thp=3Dmadvise system (which is certainly the case= for android and >>>>>>>>> other client systems) we would have to disable large anon folios = for VMAs that >>>>>>>>> haven't explicitly opted in. That breaks the intention that this = should be an >>>>>>>>> invisible performance boost. I think it's important to set the po= licy for >>>>>>>>> use of >>>>>>>> >>>>>>>> It will never ever be a completely invisible performance boost, ju= st like >>>>>>>> ordinary THP. >>>>>>>> >>>>>>>> Using the exact same existing toggle is the right thing to do. If = someone >>>>>>>> specify "never" or "madvise", then do exactly that. >>>>>>>> >>>>>>>> It might make sense to have more modes or additional toggles, but >>>>>>>> "madvise=3Dnever" means no memory waste. >>>>>>> >>>>>>> I hate the existing mechanisms.=C2=A0 They are an abdication of our >>>>>>> responsibility, and an attempt to blame the user (be it the sysadmin >>>>>>> or the programmer) of our code for using it wrongly.=C2=A0 We shoul= d not >>>>>>> replicate this mistake. >>>>>> >>>>>> I don't agree regarding the programmer responsibility. In some cases= the >>>>>> programmer really doesn't want to get more memory populated than req= uested -- >>>>>> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to = do. >>>>>> >>>>>> Regarding the madvise=3Dnever/madvise/always (sys admin decision), m= emory waste >>>>>> (and nailing down bugs or working around them in customer setups) ha= ve been very >>>>>> good reasons to let the admin have a word. >>>>>> >>>>>>> >>>>>>> Our code should be auto-tuning.=C2=A0 I posted a long, detailed out= line here: >>>>>>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradea= d.org/ >>>>>>> >>>>>> >>>>>> Well, "auto-tuning" also should be perfect for everybody, but once r= eality >>>>>> strikes you know it isn't. >>>>>> >>>>>> If people don't feel like using THP, let them have a word. The "madv= ise" config >>>>>> option is probably more controversial. But the "always vs. never" ab= solutely >>>>>> makes sense to me. >>>>>> >>>>>>>> I remember I raised it already in the past, but you *absolutely* h= ave to >>>>>>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (f= or >>>>>>>> example, userfaultfd) that doesn't want the kernel to populate any >>>>>>>> additional page tables. So if you have to respect that already, th= en also >>>>>>>> respect MADV_HUGEPAGE, simple. >>>>>>> >>>>>>> Possibly having uffd enabled on a VMA should disable using large fo= lios, >>>>>> >>>>>> There are cases where we enable uffd *after* already touching memory= (postcopy >>>>>> live migration in QEMU being the famous example). That doesn't fly. >>>>>> >>>>>>> I can get behind that.=C2=A0 But the notion that userspace knows wh= at it's >>>>>>> doing ... hahaha.=C2=A0 Just ignore the madvise flags.=C2=A0 Usersp= ace doesn't >>>>>>> know what it's doing. >>>>>> >>>>>> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doin= g ... in >>>>>> some cases. And these include cases I care about messing with sparse= VM memory :) >>>>>> >>>>>> I have strong opinions against populating more than required when us= er space set >>>>>> MADV_NOHUGEPAGE. >>>>> >>>>> I can see your point about honouring MADV_NOHUGEPAGE, so think that i= t is >>>>> reasonable to fallback to allocating an order-0 page in a VMA that ha= s it set. >>>>> The app has gone out of its way to explicitly set it, after all. >>>>> >>>>> I think the correct behaviour for the global thp controls (cmdline an= d sysfs) >>>>> are less obvious though. I could get on board with disabling large an= on folios >>>>> globally when thp=3D"never". But for other situations, I would prefer= to keep >>>>> large anon folios enabled (treat "madvise" as "always"), >>>> >>>> If we have some mechanism to auto-tune the large folios usage, for >>>> example, detect the internal fragmentation and split the large folio, >>>> then we can use thp=3D"always" as default configuration. If my memory >>>> were correct, this is what Johannes and Alexander is working on. >>> >>> Could you point me to that work? I'd like to understand what the mechan= ism is. >>> The other half of my work aims to use arm64's pte "contiguous bit" to t= ell the >>> HW that a span of PTEs share the same mapping and is therefore coalesce= d into a >>> single TLB entry. The side effect of this, however, is that we only hav= e a >>> single access and dirty bit for the whole contpte extent. So I'd like t= o avoid >>> any mechanism that relies on getting access/dirty at the base page gran= ularity >>> for a large folio. >>=20 >> Please take a look at the THP shrinker patchset, >>=20 >> https://lore.kernel.org/linux-mm/cover.1667454613.git.alexlzhu@fb.com/ > > Thanks! > >>=20 >>>> >>>>> with the argument that >>>>> their order is much smaller than traditional THP and therefore the in= ternal >>>>> fragmentation is significantly reduced. >>>> >>>> Do you have any data for this? >>> >>> Some; its partly based on intuition that the smaller the allocation uni= t, the >>> smaller the internal fragmentation. And partly on peak memory usage dat= a I've >>> collected for the benchmarks I'm running, comparing baseline-4k kernel = with >>> baseline-16k and baseline-64 kernels along with a 4k kernel that suppor= ts large >>> anon folios (I appreciate that's not exactly what we are talking about = here, and >>> it's not exactly an extensive set of results!): >>> >>> >>> Kernel Compliation with 8 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 0.1% | >>> | baseline-16k | 6.3% | >>> | baseline-64k | 28.1% | >>> >>> >>> Kernel Compliation with 80 Jobs: >>> | kernel | peak | >>> |:--------------|-------:| >>> | baseline-4k | 0.0% | >>> | anonfolio | 1.7% | >>> | baseline-16k | 2.6% | >>> | baseline-64k | 12.3% | >>> >>=20 >> Why is anonfolio better than baseline-64k if you always allocate 64k >> anonymous folio? Because page cache uses 64k in baseline-64k? > > No, because the VMA boundaries are aligned to 4K and not 64K. Large Anon = Folios > only allocates a 64K folio if it does not breach the bounds of the VMA (a= nd if > it doesn't overlap other allocated PTEs). Thanks for explanation! We will use more memory for file cache too for baseline-64k, right? So, you observed much more anonymous pages, but not so for file cache pages? >>=20 >> We may need to test some workloads with sparse access patterns too. > > Yes, I agree if you have a workload with a pathalogical memory access pat= tern > where it writes to addresses with a stride of 64K, all contained in a sin= gle > VMA, then you will end up allocating 16x the memory. This is obviously an > unrealistic extreme though. I think that there should be some realistic workload which has sparse access patterns. Best Regards, Huang, Ying >>=20 >>>> >>>>> I really don't want to end up with user >>>>> space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits = of large >>>>> anon folios. >>>>> >>>>> I still feel that it would be better for the thp and large anon folio= controls >>>>> to be independent though - what's the argument for tying them togethe= r? >>>>> >>=20