From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6C4F6C001B0 for ; Fri, 7 Jul 2023 16:08:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F7186B0072; Fri, 7 Jul 2023 12:08:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 77F026B0074; Fri, 7 Jul 2023 12:08:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F97A8D0001; Fri, 7 Jul 2023 12:08:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 49D206B0072 for ; Fri, 7 Jul 2023 12:08:04 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A7335140CF9 for ; Fri, 7 Jul 2023 16:08:03 +0000 (UTC) X-FDA: 80985297246.21.8F6B681 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf30.hostedemail.com (Postfix) with ESMTP id 5FF3F80149 for ; Fri, 7 Jul 2023 16:06:26 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bqcB+LYL; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688745986; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kwPjc8vcrtj61iKL+PFdekq3QvvaLbWA6Qsa4S/CLJA=; b=8JXljrDOxQ0FH0eWIQwd/5ECyb4ul+UY6cvFmUz6W2C/L3ydiZj/3Cfu0fYquapN6S56Yp f+KQWSIFSBagOhHlQ8yE4xyj5dOBggHSzYPoj9gNRfG0EiFEv4XBXdLk7SMYvClBCqYX1n gG5SZB4Sktz7355grwdVH664LrOTfZE= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=bqcB+LYL; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf30.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688745986; a=rsa-sha256; cv=none; b=OPl6o/7ie7aAmTQeguczrvgCvlo7x/OkLkjv7Qi/rluXs58h6UsIEhpymzFTse4GdODDlS CWFdxOuHANq4Q4ZbOnenGMfK3WTBNZcHkBA9DmgXd1RasxQjjdxAQnXnK0Lhik9ReL3tyV bXxsyFV0jE+Pgu+RoYG0fEmejyvcj4U= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1688745985; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kwPjc8vcrtj61iKL+PFdekq3QvvaLbWA6Qsa4S/CLJA=; b=bqcB+LYL5KjUnIegbgjvcmL+uRkgMMlu+08OGAq3jdds2J4fyxMvUCaiuJTLmQjV5b0tbL y6eGVT76b9+gdYhb5TSsL4QeQqAZIEC9dfNQgdg6RYpJX/x1lG1tcJ4MPOH/SO9Yd+uUuY 7bH+RJqr+JIe2+1JmrgJlykxVip6VEA= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-594-akNG5z6hMHyVB3B84L8-5g-1; Fri, 07 Jul 2023 12:06:24 -0400 X-MC-Unique: akNG5z6hMHyVB3B84L8-5g-1 Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-3fa979d0c32so11742055e9.2 for ; Fri, 07 Jul 2023 09:06:23 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688745983; x=1691337983; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=kwPjc8vcrtj61iKL+PFdekq3QvvaLbWA6Qsa4S/CLJA=; b=CuHD0L+Jhf38MM25Klexoi65p0UalTukxC9vd8RAdOonjzizz/TOGlm3e9aU0m9eZD t2AETzXa4ETGmCP6b30FoCAudOiOST14uJdjqI1zk8GBZim/7MbiCqPmJV6cBHEqOwgt 4I/0Uo8KZ0d470tHMv+E7Sh8Rcd8a2rbu8J8E6TkTEeQZR6r9JYkGtmDmZMq7HQ3H4vx 1fcMPS9yM/l2eJQKVZjHeAqj4iQwXqWp7ZiW4gXGnu9pQsAGiTcz/HYkT7+XRTJcx5/M gfdS5iBcPd08bRUwcpGmKdime9xh0/mUBO6i8DXU0xQ0XVh/6dWyzvsCnPL65dUnmFCc gn8Q== X-Gm-Message-State: ABy/qLbm8hfrdnJI0xHcYZ9D0zaZpFO3cggSOr37IFxJFG+yGdI3mJuf trXWBZgiyLEZCqJv3mN0TrsG7EhuLVyoFppLOCPl/5fd89d1PjqoZ5CoUncZIceoLDBRYkhICqZ ZRPMqYbSjbVE= X-Received: by 2002:a5d:58c5:0:b0:314:5f6f:68ce with SMTP id o5-20020a5d58c5000000b003145f6f68cemr3020447wrf.66.1688745982878; Fri, 07 Jul 2023 09:06:22 -0700 (PDT) X-Google-Smtp-Source: APBJJlH5cYvipoXxcLVPWrgX1VBfttdwQWGQsXJzVcd1hL47gcf10idF/3InhZjed94f6o29TxN3Jg== X-Received: by 2002:a5d:58c5:0:b0:314:5f6f:68ce with SMTP id o5-20020a5d58c5000000b003145f6f68cemr3020411wrf.66.1688745982463; Fri, 07 Jul 2023 09:06:22 -0700 (PDT) Received: from ?IPV6:2003:d8:2f04:3c00:248f:bf5b:b03e:aac7? (p200300d82f043c00248fbf5bb03eaac7.dip0.t-ipconnect.de. [2003:d8:2f04:3c00:248f:bf5b:b03e:aac7]) by smtp.gmail.com with ESMTPSA id v8-20020a5d5908000000b0031437ec7ec1sm4858220wrd.2.2023.07.07.09.06.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 07 Jul 2023 09:06:21 -0700 (PDT) Message-ID: Date: Fri, 7 Jul 2023 18:06:20 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: Ryan Roberts , Matthew Wilcox Cc: "Huang, Ying" , Andrew Morton , "Kirill A. Shutemov" , Yin Fengwei , Yu Zhao , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20230703135330.1865927-1-ryan.roberts@arm.com> <20230703135330.1865927-5-ryan.roberts@arm.com> <87edlkgnfa.fsf@yhuang6-desk2.ccr.corp.intel.com> <44e60630-5e9d-c8df-ab79-cb0767de680e@arm.com> <524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com> <1e406f04-78ef-6573-e1f1-f0d0e0d5246a@redhat.com> <9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance In-Reply-To: <9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 5FF3F80149 X-Stat-Signature: e1595m57wfztygj4wgdft9t65xpjhx46 X-Rspam-User: X-HE-Tag: 1688745986-690738 X-HE-Meta: U2FsdGVkX1/42jETjPMpckXnC3B0JDQrvgTcnL/g6rUNTyZOu75F//RwJgfnA4qiHttwN8sfVeNgU+5TIT/VCEe8LhsqU8Wu2MVzezE3J4i0zpLWY2yjzNyHmMyK0YCdVDPKmArmMZidBGsmQ4wrWj7oLrlZIOuvN2uOmP9OpymHwO3wu5Ocuu9Pi+meADVsETCzIb/M47Yx3YTQIWp0gbOkubYjd+/9dwTzjWEHoPXF9buvreMvS30Wxy8k5QEn40qQRWKrqHZaJpgoFGrYtNFMfT6TSzP9EmvtiQKNGrFmJ2r+12dXqyOZPpzOJDVR8mEQ36qALtxZJSmt+qvocFor/ggPHiANf2yzSKUIRItVNrQ/3eALOrr9IHJVll9MEZ6dLkffhY5n+xFc/7DAEho/YWc58cK6GcEWW+9LfNry90vTMjliJN6bwtLU9uZAHi7zeN7VXAVFbg+yczUmro1ocnN9H01olMj29dwsSKPWHTL3CR/+hVZIAB20ckk/HHocea4RFZShZcQujLqjhNiOWDK5FRrY+MwhAHi43Oo6CnljMi6dFs0TyQdc9jUPdY57iEzzcqDTYSdXMD26zPV76s1xUEGYuqqXDMY3EbnXzTxLPrzMSF8dLqdIAOP68Q5BXO+2e2s7Xkf1KyIazWlaazzHUm76DvYraN0tqz4zVKkMTCxt65Dt6HtCCG98CP07BFlTYd6R9WDgq3Xx9wxupYEfpDpa/12AXP5TA3aW/wQBZuklZszXuqURnuGnBGRL7rpZX7QR7jG7lHwJ3fBEOZZ1XdOxkD/wqORONzNLF20sAUA7eFenAQMTVac2Yp3+eZqk9GpxmDZv4vOR2F2i48qxrTWdghPmm0pYJwZh/I3VKZAxkDi4KS9+dK6XOTSfPhkcA6da3Ts+WKS/EfZUmrJs26heTpT7k7AOD4jK341zyPQUY7ASfLqjH8OV+lWe5KiLOwygg0wi6ak ISdRPQp6 LiTwobDkeybIQqPvWzgVn8XHMfCzhuek9rxVpCVphhjlmyBAxNATK246JssZicoFqET2OohQ5LyXW1Y7CctPZ1B4xtEfCuRnzPUfgtcGt4ggwkGSxVt9YFgxrDSnYMfebT6AF3M3ZESka1VN2W0BlEOUij8Xh7IGjdwTJd2uBenoQvsJn58ItiswP2MCL4ALvDVFx028DJy2WGmLjVdTR0NfKOFqAWt7KvXW2L1Lww0Mo4rZtPrjANE3E5A/MjxIQA9o6phCuDLrq8YIS1xSjQc5GWHZBgEaLh0zbGFNuEfv4ynntaB6tpVP9evgyOaujwQLiHe/DgUSjZdvbfayOXNO4EMiWBwmUP0NpxdWp4qfr2Q+BfOvrDFc1tKLuBnD8g2wTdq9KFJo6IRF6WtG9oNiQMfaw5Yh2B82VwwpabuasRAFw1tjXBz/eDORsGRdoT8kecGVly3IpwZ0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 07.07.23 17:13, Ryan Roberts wrote: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely.  So, I think that finally we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled.  I'm not sure whether it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because that >>>>> implies that we would want to follow the "always" and "madvise" advice too; >>>>> That >>>>> means that on a thp=madvise system (which is certainly the case for android and >>>>> other client systems) we would have to disable large anon folios for VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this should be an >>>>> invisible performance boost. I think it's important to set the policy for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just like >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If someone >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=never" means no memory waste. >>> >>> I hate the existing mechanisms.  They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly.  We should not >>> replicate this mistake. >> >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than requested -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >> >> Regarding the madvise=never/madvise/always (sys admin decision), memory waste >> (and nailing down bugs or working around them in customer setups) have been very >> good reasons to let the admin have a word. >> >>> >>> Our code should be auto-tuning.  I posted a long, detailed outline here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ >>> >> >> Well, "auto-tuning" also should be perfect for everybody, but once reality >> strikes you know it isn't. >> >> If people don't feel like using THP, let them have a word. The "madvise" config >> option is probably more controversial. But the "always vs. never" absolutely >> makes sense to me. >> >>>> I remember I raised it already in the past, but you *absolutely* have to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then also >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >> >> There are cases where we enable uffd *after* already touching memory (postcopy >> live migration in QEMU being the famous example). That doesn't fly. >> >>> I can get behind that.  But the notion that userspace knows what it's >>> doing ... hahaha.  Just ignore the madvise flags.  Userspace doesn't >>> know what it's doing. >> >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ... in >> some cases. And these include cases I care about messing with sparse VM memory :) >> >> I have strong opinions against populating more than required when user space set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sysfs) > are less obvious though. I could get on board with disabling large anon folios > globally when thp="never". But for other situations, I would prefer to keep > large anon folios enabled (treat "madvise" as "always"), with the argument that > their order is much smaller than traditional THP and therefore the internal > fragmentation is significantly reduced. I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of large > anon folios. I was briefly playing with a nasty idea of an additional "madvise-pmd" option (that could be the new default), that would use PMD THP only in madvise'd regions, and ordinary everywhere else. But let's disregard that for now. I think there is a bigger issue (below). > > I still feel that it would be better for the thp and large anon folio controls > to be independent though - what's the argument for tying them together? Thinking about desired 2 MiB flexible THP on aarch64 (64k kernel) vs, 2 MiB PMD THP on aarch64 (4k kernel), how are they any different? Just the way they are mapped ... It's easy to say "64k vs. 2 MiB" is a difference and we want separate controls, but how is "2MiB vs. 2 MiB" different? Having that said, I think we have to make up our mind how much control we want to give user space. Again, the "2MiB vs. 2 MiB" case nicely shows that it's not trivial: memory waste is a real issue on some systems where we limit THP to madvise(). Just throwing it out for discussing: What about keeping the "all / madvise / never" semantics (and MADV_NOHUGEPAGE ...) but having an additional config knob that specifies in which cases we *still* allow flexible THP even though the system was configured for "madvise". I can't come up with a good name for that, but something like "max_auto_size=64k" could be something reasonable to set. We could have an arch+hw specific default. (we all hate config options, I know, but I think there are good reasons to have such bare-minimum ones) -- Cheers, David / dhildenb