From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8C4DEB64DD for ; Mon, 10 Jul 2023 03:04:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 07C3B6B0072; Sun, 9 Jul 2023 23:04:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 02C016B0074; Sun, 9 Jul 2023 23:04:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E0F036B0075; Sun, 9 Jul 2023 23:04:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CD7D86B0072 for ; Sun, 9 Jul 2023 23:04:49 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 91B48C0250 for ; Mon, 10 Jul 2023 03:04:49 +0000 (UTC) X-FDA: 80994209898.28.4E1CD75 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf15.hostedemail.com (Postfix) with ESMTP id 62D31A000B for ; Mon, 10 Jul 2023 03:04:47 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=mwu7PuKM; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688958287; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=q8BqGoEU1uj0dnT930CYX/wOUge0D4WGb6YvvuusOYk=; b=wLTNFzXrzEc+QhfC/PA4IUUy1Bd10T6+WePwVRCDfvJYuxXG9sBES2nM/NjvF92ZCfjyVM ezmIbS2i2wmO01gAvM68BfOWFOZlLwtXVgf1XDy1GEvCKbYnUAvfYtrIWBeXpZdnV+AlFH RNFiRsxW4IBHoEQ6CNdUsf6FqY+g+Q4= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=mwu7PuKM; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688958287; a=rsa-sha256; cv=none; b=rHDCe9HEk3thw6f/Xlij4E5JWTwkmra87TUXNprr+ZBMnec/egBoLslArEnGVhkMdUlxOi fPi5xZbIRsMftAS8/kP09QPX0Xonll9kOC2YyDhHtkoFnuOE3WZ/+L/SiQbd8+YOCUSNfl sB5KmGSHCIBtwFopTxSPqlSUcZR1z4o= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1688958287; x=1720494287; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version:content-transfer-encoding; bh=nMqtwlxWp4NwHTz7RdNmIP/l04x9qUUR29iSxAEIaVo=; b=mwu7PuKMvIlbJ8nla0Wzr3WbCJjhsl9p73S1d9/DQTBNeQu9PdXGPTnr GRnWV1M86cYQY+itaaK8KhfVISNILVrhD48miXd4iXTNscbQmS1tjCetU fLHd2QF7MQ7Kh2HlMA98pibCvncMGDB2ntkXNS4FXIxK4xtVxik2kYHEM uBCl7CQLeNzV4K1IJm1YTMCqwe84VduW5vNoIX8r7RDK6HOtU7wIkKjgu G7ZhjbpwAdHmyUHsPdvZgM5e2PwA30Nqk160Lq9mzSjGP46nDlB7oBlhZ jSXTsJRlB11ypBG8wve3MA3Y1rpjC8koH1FubgfPztHnU/4WNfqB49qsy Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="343835533" X-IronPort-AV: E=Sophos;i="6.01,193,1684825200"; d="scan'208";a="343835533" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 20:04:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10766"; a="810670994" X-IronPort-AV: E=Sophos;i="6.01,193,1684825200"; d="scan'208";a="810670994" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Jul 2023 20:04:37 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: David Hildenbrand , Matthew Wilcox , Andrew Morton , "Kirill A. Shutemov" , Yin Fengwei , "Yu Zhao" , Catalin Marinas , "Will Deacon" , Anshuman Khandual , Yang Shi , , , , Johannes Weiner , Alexander Zhu Subject: Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance References: <20230703135330.1865927-1-ryan.roberts@arm.com> <20230703135330.1865927-5-ryan.roberts@arm.com> <87edlkgnfa.fsf@yhuang6-desk2.ccr.corp.intel.com> <44e60630-5e9d-c8df-ab79-cb0767de680e@arm.com> <524bacd2-4a47-2b8b-6685-c46e31a01631@redhat.com> <1e406f04-78ef-6573-e1f1-f0d0e0d5246a@redhat.com> <9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com> Date: Mon, 10 Jul 2023 11:03:01 +0800 In-Reply-To: <9dd036a8-9ba3-0cc4-b791-cb3178237728@arm.com> (Ryan Roberts's message of "Fri, 7 Jul 2023 16:13:52 +0100") Message-ID: <87y1jofoyi.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 62D31A000B X-Stat-Signature: egtgw6dnp4k4n9pojxtkzji5eewb9jxk X-HE-Tag: 1688958287-298727 X-HE-Meta: U2FsdGVkX18cGtmG/wk1Y9et0n/7us2ZoX69E+LvtCaFcd6MIGloMC7vgpLIxC0EgI59jXS5bcyMMu2olcMEqiDF597m4W24Gbgs9lOQNAPiqzaxwnI5axX1WM4pbj39cdRjN6HqQ5kBkOmTNicVgg4GvZWE1WNr6BdB6FwTsLwdS5b/ZNND6WxT6qybCrGKSC0qnyiB+d0Lm8Wr1Z5cRWrd0JquXo2SgpYY6G4X9I4o/1d0ZXyjDCE1GBjjEerV/yC3urL7eCmqoKtcBnOOKyK+dWk93K7ZzyUE8Gde569xHSOmg4Lk6+Q0dsnGMNEAJSAjMIeM1hdGqCH5BlbvOiULkDjhxAPouvkv9/1vzuNoeM0S6jSHrWDWb+9YU0/b521V4ui0BOGKJPVp3eq1mr339cI9M3mj22RT/mMUR8Fs1islwlUaAVex5aPVYMbhw7cBpUgYtJ264VlJTXDuI/YOVSZx2cCAR+SNfyhsG88f6t8VRCvsScfZV9QRvELkF6F8CkMNZBUhYI98CWKMm8Zx5PuGVO9nx/CLXbf9DCMOcnWC67SNuRrt6uXxtthsjjeK6R9u8+dpQ9VROA7kb0+GeO+TE7jfedTB7+emr8Wx/FOMxfUY02Fetv9uw2eQCzBm4LGKY9aQLwxrUNn4MH3arLNE3QnJ7Wt3i6MFudLQnyFl90eiTmsjViqY+s6EIJS+9dpSa5y6R0IXZpY6eT+PKYzqq6dU7Jncw9qdVOj0Qr0zXyfAGw27n9Xlsro70s5FJ81+GjCyqL60G63+PB3fx6JY3lGfuZMc/daxXT+e9J2xZKotKncnVLUwEn2vj77HF8Yzn7UvpZRB2XNSKfI5zFOXmt3eih9xpFlPmCkNlXKttwnB1JyURywxOGGtFD4kHuGCIEKO2OTJTynRPwldUEvc9FmB8vEbiiPhJbFfxCvun2QFBdLZXwii8Jbh4uEM/kz0UBt3ADMKxK7 bDiz8cbi BBz70KbVyci/V9ry9pBz6qvpaUGcuY7ySwFhmc6TDb6hrfYnQr3eqdtxalzxsLQSzwdfwa+fLXNnev3XU6BJdBFBYDhrkE3MHjTynmX4Ev1XskO8+DcqofKIgrPHmN7fbwECTuepilIs9AIegkhNmJWS1RD83G3Lf4Av8hy98vr2BUpjrdiiUW1UqaSVRuTZ6/8kRaQM+mQtBzdJU/povCp1sRXPA+FVbNj82fU7ft5Wlf9G+NP1GTNeXnA3GcPW2rFkq9qVmra9HZuhN9ii3xWjItyx6Vi+iQAQOrBUYnlQ8D7J+kzftW+MCqUGXI1fByfnVmCZ3Mtc6xSZVLxlAieXZVBwz8QgwzNdjWEz04xM2Acc6sMcmoVb1pw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Ryan Roberts writes: > On 07/07/2023 15:07, David Hildenbrand wrote: >> On 07.07.23 15:57, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:29:02PM +0200, David Hildenbrand wrote: >>>> On 07.07.23 11:52, Ryan Roberts wrote: >>>>> On 07/07/2023 09:01, Huang, Ying wrote: >>>>>> Although we can use smaller page order for FLEXIBLE_THP, it's hard to >>>>>> avoid internal fragmentation completely.=C2=A0 So, I think that fina= lly we >>>>>> will need to provide a mechanism for the users to opt out, e.g., >>>>>> something like "always madvise never" via >>>>>> /sys/kernel/mm/transparent_hugepage/enabled.=C2=A0 I'm not sure whet= her it's >>>>>> a good idea to reuse the existing interface of THP. >>>>> >>>>> I wouldn't want to tie this to the existing interface, simply because= that >>>>> implies that we would want to follow the "always" and "madvise" advic= e too; >>>>> That >>>>> means that on a thp=3Dmadvise system (which is certainly the case for= android and >>>>> other client systems) we would have to disable large anon folios for = VMAs that >>>>> haven't explicitly opted in. That breaks the intention that this shou= ld be an >>>>> invisible performance boost. I think it's important to set the policy= for >>>>> use of >>>> >>>> It will never ever be a completely invisible performance boost, just l= ike >>>> ordinary THP. >>>> >>>> Using the exact same existing toggle is the right thing to do. If some= one >>>> specify "never" or "madvise", then do exactly that. >>>> >>>> It might make sense to have more modes or additional toggles, but >>>> "madvise=3Dnever" means no memory waste. >>> >>> I hate the existing mechanisms.=C2=A0 They are an abdication of our >>> responsibility, and an attempt to blame the user (be it the sysadmin >>> or the programmer) of our code for using it wrongly.=C2=A0 We should not >>> replicate this mistake. >>=20 >> I don't agree regarding the programmer responsibility. In some cases the >> programmer really doesn't want to get more memory populated than request= ed -- >> and knows exactly why setting MADV_NOHUGEPAGE is the right thing to do. >>=20 >> Regarding the madvise=3Dnever/madvise/always (sys admin decision), memor= y waste >> (and nailing down bugs or working around them in customer setups) have b= een very >> good reasons to let the admin have a word. >>=20 >>> >>> Our code should be auto-tuning.=C2=A0 I posted a long, detailed outline= here: >>> https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.or= g/ >>> >>=20 >> Well, "auto-tuning" also should be perfect for everybody, but once reali= ty >> strikes you know it isn't. >>=20 >> If people don't feel like using THP, let them have a word. The "madvise"= config >> option is probably more controversial. But the "always vs. never" absolu= tely >> makes sense to me. >>=20 >>>> I remember I raised it already in the past, but you *absolutely* have = to >>>> respect the MADV_NOHUGEPAGE flag. There is user space out there (for >>>> example, userfaultfd) that doesn't want the kernel to populate any >>>> additional page tables. So if you have to respect that already, then a= lso >>>> respect MADV_HUGEPAGE, simple. >>> >>> Possibly having uffd enabled on a VMA should disable using large folios, >>=20 >> There are cases where we enable uffd *after* already touching memory (po= stcopy >> live migration in QEMU being the famous example). That doesn't fly. >>=20 >>> I can get behind that.=C2=A0 But the notion that userspace knows what i= t's >>> doing ... hahaha.=C2=A0 Just ignore the madvise flags.=C2=A0 Userspace = doesn't >>> know what it's doing. >>=20 >> If user space sets MADV_NOHUGEPAGE, it exactly knows what it is doing ..= . in >> some cases. And these include cases I care about messing with sparse VM = memory :) >>=20 >> I have strong opinions against populating more than required when user s= pace set >> MADV_NOHUGEPAGE. > > I can see your point about honouring MADV_NOHUGEPAGE, so think that it is > reasonable to fallback to allocating an order-0 page in a VMA that has it= set. > The app has gone out of its way to explicitly set it, after all. > > I think the correct behaviour for the global thp controls (cmdline and sy= sfs) > are less obvious though. I could get on board with disabling large anon f= olios > globally when thp=3D"never". But for other situations, I would prefer to = keep > large anon folios enabled (treat "madvise" as "always"), If we have some mechanism to auto-tune the large folios usage, for example, detect the internal fragmentation and split the large folio, then we can use thp=3D"always" as default configuration. If my memory were correct, this is what Johannes and Alexander is working on. > with the argument that > their order is much smaller than traditional THP and therefore the intern= al > fragmentation is significantly reduced. Do you have any data for this? > I really don't want to end up with user > space ever having to opt-in (with MADV_HUGEPAGE) to see the benefits of l= arge > anon folios. > > I still feel that it would be better for the thp and large anon folio con= trols > to be independent though - what's the argument for tying them together? > Best Regards, Huang, Ying