From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39091C83F14 for ; Thu, 31 Aug 2023 01:43:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 900548E0005; Wed, 30 Aug 2023 21:43:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8B0A38D0001; Wed, 30 Aug 2023 21:43:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 778978E0005; Wed, 30 Aug 2023 21:43:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 679D88D0001 for ; Wed, 30 Aug 2023 21:43:12 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 1978B140498 for ; Thu, 31 Aug 2023 01:43:12 +0000 (UTC) X-FDA: 81182701824.13.601754B Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.31]) by imf22.hostedemail.com (Postfix) with ESMTP id B7B8BC000E for ; Thu, 31 Aug 2023 01:43:08 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=nGlBzfry; spf=pass (imf22.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693446190; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eG/3txAyhViiMwKJPQG84yqiVQ2v4BgTodYLTuRTGKM=; b=66AXkTdPDcsiSRI7qL58GnbrwKZdGvOq8SGds/r7mXMIoPFrPU9HBfjdI5t1VhM7tceQM6 97EChWwoeCNYZzu+IuW3fxiwoRv3OK6uho4lSdP4lppMTRrb3jENA4jPBIy7ghlI2xD2MI vEFLyyKXs133Vo3qB75JlAuqHZ1QPsQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693446190; a=rsa-sha256; cv=none; b=zRIaySkdQxXMcV5ibfA0tdRs1f77FCTgzO0CpPvX2uEkKGXOKoCTrtDT6Rn4HFfdnxymPu tzfOWXOxrmPjT8NO1UPWiRMSQ10onnZkgfpmAPF7nMueQxsl3tQ0Pr+SoXQDaxN3IdXPZc IrEsaibJHQRCzxHfPmiqZwnXcIgEdcc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=nGlBzfry; spf=pass (imf22.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693446188; x=1724982188; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=MDzWu1GQiMk1lZ4sUWL+VXmG6+516QWwdoCHS5AYa5A=; b=nGlBzfryAtSulp3j3GKBcfGLPRwlRrivMJF8rOKLOVqTXs2ZBEt4ESPE cmiwmzVY79Z2fqnpCma6Y4dnFCMWkJUMEqncx/9UjfcwCdPVnha4QkE6q 9eBPxWxOwepqwHz0KH/xQwy/la8gDFq8uaZMjB25fcNTcg9S9mfMcJHwk 8b1xFqV9blKUZCbZeFKLA9MXXE0zLCiWCZyQdCajjssmzZQTxb0AHHVqz NSHDvtkSJxMbUO9B07HJ4DFJS9DrcSDY2UKk4/ql1loYrTyBClYTX3C/h HVDy3xIhtvo3DsUdLUtaPM9gT1OgvHrkF8T/SywzbtiDZegGjdJVp5wIA Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10818"; a="439739746" X-IronPort-AV: E=Sophos;i="6.02,215,1688454000"; d="scan'208";a="439739746" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Aug 2023 18:43:06 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10818"; a="913030374" X-IronPort-AV: E=Sophos;i="6.02,215,1688454000"; d="scan'208";a="913030374" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Aug 2023 18:43:01 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , , , Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance References: <20230810142942.3169679-1-ryan.roberts@arm.com> <20230810142942.3169679-4-ryan.roberts@arm.com> <87v8dg6lfu.fsf@yhuang6-desk2.ccr.corp.intel.com> <5c9ba378-2920-4892-bdf0-174e47d528b7@arm.com> Date: Thu, 31 Aug 2023 09:40:52 +0800 In-Reply-To: <5c9ba378-2920-4892-bdf0-174e47d528b7@arm.com> (Ryan Roberts's message of "Wed, 30 Aug 2023 13:07:01 +0100") Message-ID: <87cyz43s63.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: B7B8BC000E X-Rspam-User: X-Stat-Signature: f851yg6xorfjw4wuko3cnq5z5dooroua X-Rspamd-Server: rspam03 X-HE-Tag: 1693446188-400384 X-HE-Meta: U2FsdGVkX190G1myA5DY9UAtltcv30yvi+H2otemM2huwRs5mpVTn5bEvm2OLQEWurBAxsS6XG/ODg8WwmUzZ+lHhxCqfuWsXetQX19b5/U5GcUE+3+/qckcEKmBMqeicEspVdV7arb2rGK3Pc1wyzuHVdBTFieoucnEBH2/OsxfNvAJhb6xD8SDbGt9pHjKiRAud5mNn+iAs3xnaIDGR/43YwLAKgBt/XDOXOI8X7aVd3obpe09PUl1QYsbHXLLuv9Xqr/RKYHYrC102kEdAHk6HgTXiI0G27OUrSP79zqP3K54/nCZHoW1gH+ME10HFS2JWmsST4nx29mlkc2rtje220SB/LwoYmxWpxA/e+ZZUFPZTTLA/ip6iwrhiKLWMivKRRvKqSnu8xeHY3J3n1vezMCB9TkM7IOOe8q7Vew2he1Nv8Hw6l/PHW+mTODCIpFsj1heMkQRqxAEFV312KlBmMrlIdDR60o+S6Gib5eknHacTyfiqwIScbUuXKfkUPvrpeqtTk20JjdiLo0OiubM35NXOrYBCFDNa5oZV779zrz1VDGodzIDHfRBARD1gyzmnHhpraam5HHYg8atzBqogDKp6hOjERGZT5ndEqz3UHB1IkcDbXR0e4cE2tMvsWCnO5FzT53ciamoEsFCDQxkNbfQQGfQeJbDn/V1oY5Ge9ZJqc/UNkBEzMoeWhSWyYBn1iTs8cBVL5QHdGmO4uMR9RN4U62wbz76fsExG3aeGQjLTDb8TmMWFjbIOXsjZFcpxs0ZbDdkyuiMDaPJXQo64ykOnCAxleUEl0bUSp62hqaIMLP/LEQI4ZWlLk/9lrtMnjhL6StVi41jrCKyxyzKnsK9PGlLOFwc+kExQit+9wBpYOabyfWc66FJp3txQGGnKpP2F+b6v1cuxY14ZYmF2D4as/3SaOXamvYTg9biAcwiyEsrgrCuZXtAuxjmi32yN3w/eBxBCQLC9Cw Odh8vQXV s9MA5RrdBePBgasDR3XEy4DXAbb2FkaZZntwA3yUxxdIHwmUG/9ck+6O4yWhyE4ivPnoT8RFI9LaZ39CpicFHXhUaQEvSFeDjPd0YeP7vwXZcCMDleW1yzo5U+SZrgODaZsG0grZ+C/p7kUC/QnPT5yhU24akKNWhE5HS6xTmFif9R4h1APwq2UFTZ96VAm9TTp1xWC5P4WYwREmAMcWfSLKFrZ1f2SuZKtridc6M9YiTnHq+jmBxkgIyKE/zzxu0jNIOxFVxMVYO/Hyeo/7wKZaeslWcr1JPfwehUs44vZ/tpAVv0qyvpyO8q7N4HC1CstmH53N4qhgDdaI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Ryan Roberts writes: > On 15/08/2023 22:32, Huang, Ying wrote: >> Hi, Ryan, >> >> Ryan Roberts writes: >> >>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be >>> allocated in large folios of a determined order. All pages of the large >>> folio are pte-mapped during the same page fault, significantly reducing >>> the number of page faults. The number of per-page operations (e.g. ref >>> counting, rmap management lru list management) are also significantly >>> reduced since those ops now become per-folio. >>> >>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, >>> which defaults to disabled for now; The long term aim is for this to >>> defaut to enabled, but there are some risks around internal >>> fragmentation that need to be better understood first. >>> >>> Large anonymous folio (LAF) allocation is integrated with the existing >>> (PMD-order) THP and single (S) page allocation according to this policy, >>> where fallback (>) is performed for various reasons, such as the >>> proposed folio order not fitting within the bounds of the VMA, etc: >>> >>> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >>> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >>> ----------------|-----------|-------------|---------------|------------- >>> no hint | S | LAF>S | LAF>S | THP>LAF>S >>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>S >>> MADV_NOHUGEPAGE | S | S | S | S >> >> IMHO, we should use the following semantics as you have suggested >> before. >> >> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >> ----------------|-----------|-------------|---------------|------------- >> no hint | S | S | LAF>S | THP>LAF>S >> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >> MADV_NOHUGEPAGE | S | S | S | S >> >> Or even, >> >> | prctl=dis | prctl=ena | prctl=ena | prctl=ena >> | sysfs=X | sysfs=never | sysfs=madvise | sysfs=always >> ----------------|-----------|-------------|---------------|------------- >> no hint | S | S | S | THP>LAF>S >> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S >> MADV_NOHUGEPAGE | S | S | S | S >> >> From the implementation point of view, PTE mapped PMD-sized THP has >> almost no difference with LAF (just some small sized THP). It will be >> confusing to distinguish them from the interface point of view. >> >> So, IMHO, the real difference is the policy. For example, prefer >> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs >> interface is used to specify system global policy. In the long term, it >> can be something like below, >> >> never: S # disable all THP >> madvise: # never by default, control via madvise() >> always: THP>LAF>S # prefer PMD-sized THP in fact >> small: LAF>S # prefer small sized THP >> auto: # use in-kernel heuristics for THP size >> >> But it may be not ready to add new policies now. So, before the new >> policies are ready, we can add a debugfs interface to override the >> original policy in /sys/kernel/mm/transparent_hugepage/enabled. After >> we have tuned enough workloads, collected enough data, we can add new >> policies to the sysfs interface. > > I think we can all imagine many policy options. But we don't really have much > evidence yet for what it best. The policy I'm currently using is intended to > give some flexibility for testing (use LAF without THP by setting sysfs=never, > use THP without LAF by compiling without LAF) without adding any new knobs at > all. Given that, surely we can defer these decisions until we have more data? > > In the absence of data, your proposed solution sounds very sensible to me. But > for the purposes of scaling up perf testing, I don't think its essential given > the current policy will also produce the same options. > > If we were going to add a debugfs knob, I think the higher priority would be a > knob to specify the folio order. (but again, I would rather avoid if possible). I totally understand we need some way to control PMD-sized THP and LAF to tune the workload, and nobody likes debugfs knob. My concern about interface is that we have no way to disable LAF system-wise without rebuilding the kernel. In the future, should we add a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be stricter than "never"? "really_never"? -- Best Regards, Huang, Ying