From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11D65C83F2F for ; Thu, 31 Aug 2023 17:15:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 46333900006; Thu, 31 Aug 2023 13:15:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4133A8D0001; Thu, 31 Aug 2023 13:15:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2DB13900006; Thu, 31 Aug 2023 13:15:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 1E9058D0001 for ; Thu, 31 Aug 2023 13:15:26 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E552180374 for ; Thu, 31 Aug 2023 17:15:25 +0000 (UTC) X-FDA: 81185051010.02.2616719 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf09.hostedemail.com (Postfix) with ESMTP id 1CC6514002D for ; Thu, 31 Aug 2023 17:15:22 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=DpeVSLLE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693502123; a=rsa-sha256; cv=none; b=z27i0F/sv0jJDMl5+46wyAPx/NHINRBgKyrtZQErtkzmRQiBW5jfKF6xwge9k2WNlG9KjC Q1YwIzIG+9b/8X2Zx/eOOov07oEGV8BQ7bIbE5u/pBzwXMBeKBG0EtzlkEYUvvvPUyuxbe Bg6hKLwM1WZuWx3Qc7bKzoyjxpvsG3Q= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=DpeVSLLE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693502123; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qMHknYwipePMPg9mivxBhPSOU5ukXfE3oVZOzdjAQnE=; b=WGlX/6UWFUKkAFgFVm2fZDsYsZ+kodVxMG+pIFHKBgQEAmva1jEyQXSKRQeJ+GYNee3fVJ c7NNKcrvM+kfRwZsMUzCsj3ULvbHeMnJO5afDYGdYoYZx2+NqNP++tcqT4ocFow6zipDDo WiIGIl7ZdhnkGrIt8NA56H+XZ9EJCS8= Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-1bf57366ccdso15284655ad.1 for ; Thu, 31 Aug 2023 10:15:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1693502122; x=1694106922; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qMHknYwipePMPg9mivxBhPSOU5ukXfE3oVZOzdjAQnE=; b=DpeVSLLEPEe3jvPkxFd+99CuqJKlO4kfLxCq56Q96/GD3tr/sOy2YJ1nkPuYftljKH umoiWC1NrIhJphYL4UjmispHAHiHKqIfXSbhP1NdUcOynWFShfz1s9eONY3LWBeQJj/z zqIkNqDPqlsPE7MtYx5C0yPySy3nsX4F9j7RMs9gtkwLdvTI/ym2XniMoCz/XokYzPWB LpBSfGgVCVu0BWP6PEX3qteML+a5RJV1O3ITu8nPiG9sEmBMMS8b2tNP+UZv02633q/A q59Zp8+uMxds4gAD5bIQnde2I/OGoi+WtlUi1xFmMdW8/dGEhaVNVLzx5Ha7p+qbbrLu GasQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693502122; x=1694106922; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qMHknYwipePMPg9mivxBhPSOU5ukXfE3oVZOzdjAQnE=; b=Qourb5i4/wltuDQnHJ/QAsxtHTr5znCTiwuSyckXLNWA4rdT9MLCz/WQS8eFzERulO 6hqEjEFdcIMT9x62dMZGjJlGbBy5aKt5OCcB7wx7rRR5r+2HzBRwifIPhSHGHQQ77amp 2TA0d8eQZcLPgC0oRK+LUO2YWLPNzy+iU4X6ocCbO4C9ORFcT37gIPQpStEV0SK6n2/T ZaPMJIqfrQ/6Qfq+oMrGpAJNMgjTguXpPetqfTGatYRiKVhXi68yrLj0H6p7tScRB10+ ntK/yx3GgJ4xbHj01Meu/IQ+QAcSkLjSZOe2GrhGwaZFoQsxqS1j3Nz8KuIuwuXiS/My ax/A== X-Gm-Message-State: AOJu0YzawcPnHkc8qcszrk5WZZsqQe+0QlOzrjZ8geK2ONI1t6EyTW0H 3XmzlRX/P0mSkasaVunvp3nTwH3suCziYnWl5Zg= X-Google-Smtp-Source: AGHT+IFDueNVL2EiBWVbrKnpkQVZ9sOGKNLHQVO+89nKgFeUdmB9fGAt4vbz2gNM/f9OXd6KaQHBll2YRmaVEPSmZCA= X-Received: by 2002:a17:90a:ff05:b0:26f:6f2a:a11 with SMTP id ce5-20020a17090aff0500b0026f6f2a0a11mr208538pjb.12.1693502121713; Thu, 31 Aug 2023 10:15:21 -0700 (PDT) MIME-Version: 1.0 References: <20230810142942.3169679-1-ryan.roberts@arm.com> <20230810142942.3169679-4-ryan.roberts@arm.com> <87v8dg6lfu.fsf@yhuang6-desk2.ccr.corp.intel.com> <5c9ba378-2920-4892-bdf0-174e47d528b7@arm.com> <87cyz43s63.fsf@yhuang6-desk2.ccr.corp.intel.com> <4e14730b-4e4c-de30-04bb-9f3ec4a93754@redhat.com> In-Reply-To: <4e14730b-4e4c-de30-04bb-9f3ec4a93754@redhat.com> From: Yang Shi Date: Thu, 31 Aug 2023 10:15:09 -0700 Message-ID: Subject: Re: [PATCH v5 3/5] mm: LARGE_ANON_FOLIO for improved performance To: David Hildenbrand Cc: "Huang, Ying" , Ryan Roberts , Andrew Morton , Matthew Wilcox , Yin Fengwei , Yu Zhao , Catalin Marinas , Anshuman Khandual , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 1CC6514002D X-Stat-Signature: q749kgo8fr7wn6kwf16osnwpurwdouwc X-HE-Tag: 1693502122-57683 X-HE-Meta: U2FsdGVkX18uaasYv8pcvpUG+RPsvdjx3ewR5PCYYRIzy4mW1hQCHlVclS1+CRvBmTN0EV77fQrqVMyDraQx42tYbi44nhDdhr8xoPvl2HGPoBVhHv67Pj/JO2pvuMvVixm50fCnIPXRp8RhNOdEaFOl0xgqZUVjLCtEAmSU8/cw+Z1Zk/6/zfkmXwuyMSbjv3CPwQRrpqStQ8Yc1XtXRFneW2TqdTm/q2AjYam5pKSQ+5KLIG11c8ZRG6r+smKao+l3ViOC2ZNvdwiu3UiYUGgKuXBlpz4mXSz5uvWV1ZHRRSta64jF5XiQK8vskj4bl2QyAg4FniLsv2e8CNu4xBLsNDd2HUCCHdEHE4YA/YxWxViU73QaryUE7RPssAbp7NM9PxKO2WE9YDUA4xukKmuuU2la6M6ZruYxOJXjENlqmbFGNAlb8jvPkzyYvwTmB1rGEs9BfziE36tCBVfP56+odm1PJZz8wttBnNEN9EJlMzd9Sygdv9YmI/a2qwBbZjZNw4Q5sIyixluABP0k11xHOacDLIHcPIOvmsybYGddOi+NxqubpGL+NJNZNgolKb3mmBj53xSmnjX3nD1yjVg8lRkCxV080DKQ719ydkZK2sBc4SXZkLrK4KIQsW2yjjyGIj/MSsoSae6VeoeXjBfVt/5uMcFvgvzRQZ7pYEYP7KYg4VMNztAYtHQa7fEiRYTaH7C1QqU78/5CJKHbU+6fZDtBSqWuc6SvDj9Id109ubfCraj2Hcsxc7I5260EqrH3J2NEhNIieCR2R2WcK3Pk6SddNIIVnqUwBYXG+2kzvessAyp5dNYMuNWhizn7UzHkeAAjQfrpk0MJfp7Q5w+BVJmJKoSLunjZBELPoxA7XMgloE9TLnoxJ5QbnzNApKKP8WqlVIX9R15O1x6RdDcjoQbrU7BooltR0wXGChgq9TJE4A08zgRRVzwe4G3yRiRqFN6hkQjp6JXkt2p pMeNlPJ+ CWFoZe3oFQksvg3iIkZOD66esnlcXsSJvGUlbO9n/6XpQY6vBlB8x2vJDpGeQNf9nHgwLFtGXHxZQcyAR3ZcsD3TRGPc2LXmNboiidoLOJz+hfxI9Mp3+o+SI/ffDtpucSF2BVxdtsnmwYcttMrSEDJBsTjSttNklQwzp8shNiIiXWwRBFAgU02D9uyhOhuVwK5HqwJTj0uY+od6y2xvIzwtw+INJljfd6AxBgrS1lxcCFkcmoxGP+H2jnGn8Xhy0mjyLzXb4Get7iNu9NHVaTjnfcvZEYh19zsrTB45qYBqEAP2Ds9/VAjHKg3T0lUgytFQaeVWJv2WzO2oQLBN7Zr+4fxdbtJASKP+92SxATIpH0Az4YwYurqT9MTcA4AcaYi/U0Yqzg56gYcrwJafWNdnBTuLw0p/WCM5qSzvFzC0fnxGaZj+6hexCELilWg6l8h+mJ0QI3UXTY5GPHAyV+zOMfQteLewIDok1atd7/Wm3ungnqdaM2/p6uF550CSL4veqW0Jh8GiHFPXFTKq3jfbj6GiQTAcYhrxlsYyBFxtOuurUlCp3FJ0Peg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Aug 31, 2023 at 12:57=E2=80=AFAM David Hildenbrand wrote: > > On 31.08.23 03:40, Huang, Ying wrote: > > Ryan Roberts writes: > > > >> On 15/08/2023 22:32, Huang, Ying wrote: > >>> Hi, Ryan, > >>> > >>> Ryan Roberts writes: > >>> > >>>> Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to= be > >>>> allocated in large folios of a determined order. All pages of the la= rge > >>>> folio are pte-mapped during the same page fault, significantly reduc= ing > >>>> the number of page faults. The number of per-page operations (e.g. r= ef > >>>> counting, rmap management lru list management) are also significantl= y > >>>> reduced since those ops now become per-folio. > >>>> > >>>> The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, > >>>> which defaults to disabled for now; The long term aim is for this to > >>>> defaut to enabled, but there are some risks around internal > >>>> fragmentation that need to be better understood first. > >>>> > >>>> Large anonymous folio (LAF) allocation is integrated with the existi= ng > >>>> (PMD-order) THP and single (S) page allocation according to this pol= icy, > >>>> where fallback (>) is performed for various reasons, such as the > >>>> proposed folio order not fitting within the bounds of the VMA, etc: > >>>> > >>>> | prctl=3Ddis | prctl=3Dena | prctl=3Dena | p= rctl=3Dena > >>>> | sysfs=3DX | sysfs=3Dnever | sysfs=3Dmadvise | s= ysfs=3Dalways > >>>> ----------------|-----------|-------------|---------------|---------= ---- > >>>> no hint | S | LAF>S | LAF>S | THP>LAF>= S > >>>> MADV_HUGEPAGE | S | LAF>S | THP>LAF>S | THP>LAF>= S > >>>> MADV_NOHUGEPAGE | S | S | S | S > >>> > >>> IMHO, we should use the following semantics as you have suggested > >>> before. > >>> > >>> | prctl=3Ddis | prctl=3Dena | prctl=3Dena | pr= ctl=3Dena > >>> | sysfs=3DX | sysfs=3Dnever | sysfs=3Dmadvise | sy= sfs=3Dalways > >>> ----------------|-----------|-------------|---------------|----------= --- > >>> no hint | S | S | LAF>S | THP>LAF>S > >>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S > >>> MADV_NOHUGEPAGE | S | S | S | S > >>> > >>> Or even, > >>> > >>> | prctl=3Ddis | prctl=3Dena | prctl=3Dena | pr= ctl=3Dena > >>> | sysfs=3DX | sysfs=3Dnever | sysfs=3Dmadvise | sy= sfs=3Dalways > >>> ----------------|-----------|-------------|---------------|----------= --- > >>> no hint | S | S | S | THP>LAF>S > >>> MADV_HUGEPAGE | S | S | THP>LAF>S | THP>LAF>S > >>> MADV_NOHUGEPAGE | S | S | S | S > >>> > >>> From the implementation point of view, PTE mapped PMD-sized THP has > >>> almost no difference with LAF (just some small sized THP). It will b= e > >>> confusing to distinguish them from the interface point of view. > >>> > >>> So, IMHO, the real difference is the policy. For example, prefer > >>> PMD-sized THP, prefer small sized THP, or fully auto. The sysfs > >>> interface is used to specify system global policy. In the long term,= it > >>> can be something like below, > >>> > >>> never: S # disable all THP > >>> madvise: # never by default, control via madvise() > >>> always: THP>LAF>S # prefer PMD-sized THP in fact > >>> small: LAF>S # prefer small sized THP > >>> auto: # use in-kernel heuristics for THP size > >>> > >>> But it may be not ready to add new policies now. So, before the new > >>> policies are ready, we can add a debugfs interface to override the > >>> original policy in /sys/kernel/mm/transparent_hugepage/enabled. Afte= r > >>> we have tuned enough workloads, collected enough data, we can add new > >>> policies to the sysfs interface. > >> > >> I think we can all imagine many policy options. But we don't really ha= ve much > >> evidence yet for what it best. The policy I'm currently using is inten= ded to > >> give some flexibility for testing (use LAF without THP by setting sysf= s=3Dnever, > >> use THP without LAF by compiling without LAF) without adding any new k= nobs at > >> all. Given that, surely we can defer these decisions until we have mor= e data? > >> > >> In the absence of data, your proposed solution sounds very sensible to= me. But > >> for the purposes of scaling up perf testing, I don't think its essenti= al given > >> the current policy will also produce the same options. > >> > >> If we were going to add a debugfs knob, I think the higher priority wo= uld be a > >> knob to specify the folio order. (but again, I would rather avoid if p= ossible). > > > > I totally understand we need some way to control PMD-sized THP and LAF > > to tune the workload, and nobody likes debugfs knob. > > > > My concern about interface is that we have no way to disable LAF > > system-wise without rebuilding the kernel. In the future, should we ad= d > > a new policy to /sys/kernel/mm/transparent_hugepage/enabled to be > > stricter than "never"? "really_never"? > > Let's talk about that in a bi-weekly MM session. (I proposed it as a > topic for next week). > > As raised in another mail, we can then discuss > * how we want to call this feature (transparent large pages? there is > the concern that "THP" might confuse users. Maybe we can consider > "large" the more generic version and "huge" only PMD-size, TBD) I tend to agree. "Huge" means PMD-mappable (transparent or HugeTLB), "Large" means any order but less than PMD-mappable order, "Gigantic" means PUD mappable. This should incur the least confusion IMHO. > * how to expose it in stats towards the user (e.g., /proc/meminfo) I recalled I suggested new statistics for each order, but was NAK'ed. > * which minimal toggles we want > > I think there *really* has to be a way to disable it for a running > system, otherwise no distro will dare pulling it in, even after we > figured out the other stuff. TBH I really don't like to tie large folio to THP toggles. THP (PMD-mappable) is just a special case of LAF. The large folio should be tried whenever it is possible ideally. But I do agree we may not be able to achieve the ideal case at the time being, and also understand the concern about regression in early adoption, so a knob that can disable large folio may be needed for now. But it should be just a simple binary knob (on/off), and should not be a part of kernel ABI (temporary and debugging only) IMHO. One more thing we may discuss is whether huge page madvise APIs should take effect for large folio or not. > > Note that for the pagecache, large folios can be disabled and > distributions are actively making use of that. > > -- > Cheers, > > David / dhildenb >