From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9F6EAC001DB
	for <linux-mm@archiver.kernel.org>; Thu,  3 Aug 2023 23:51:19 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 150C72802A5; Thu,  3 Aug 2023 19:51:19 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1022528022C; Thu,  3 Aug 2023 19:51:19 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F0B062802A5; Thu,  3 Aug 2023 19:51:18 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id E109328022C
	for <linux-mm@kvack.org>; Thu,  3 Aug 2023 19:51:18 -0400 (EDT)
Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 42BDA1204CC
	for <linux-mm@kvack.org>; Thu,  3 Aug 2023 23:51:18 +0000 (UTC)
X-FDA: 81084442236.13.1E019FF
Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172])
	by imf06.hostedemail.com (Postfix) with ESMTP id 7A1B4180012
	for <linux-mm@kvack.org>; Thu,  3 Aug 2023 23:51:16 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=wUM8MPQ3;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1691106676;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jJFHaRySOlD5+a75VyE5azWXSPuX0PJG9XGQ53FQ3Lw=;
	b=ghetuLLjrmvhkSb26n5Mg1vOgX8SWldF0USWAsY4SAdA7LF/o0m01y8ksRDJAWNHMfqgQc
	8nsmlIZVxJj+u8ItOL6Sa4zR+krjskeDdgbt3s/ZJWkkGBdv9xb4VqqyPHffvJffj7mF/D
	SifK2LB0NLrY4MQ3LzvRN2YLrvzn0X8=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b=wUM8MPQ3;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf06.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691106676; a=rsa-sha256;
	cv=none;
	b=NKbdDVhvLIYxhc4nGF7sy71JASK4Rqw8tJbYreeK2K+dDNu7zcuZL9luv7kxJlJdaND5RK
	nJt9+nTwKKtLmsua9syrXb424tfJGJVnJYMRTAy/m+dZmuSSHWcX/aYWmJXE2R+CE0mHH/
	ynbvXVkdQXhU9iQXdB2P0MvYrrWmlGo=
Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-40c72caec5cso134431cf.0
        for <linux-mm@kvack.org>; Thu, 03 Aug 2023 16:51:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1691106675; x=1691711475;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=jJFHaRySOlD5+a75VyE5azWXSPuX0PJG9XGQ53FQ3Lw=;
        b=wUM8MPQ32uXwz3fhBveoBzB4MSc3nb+xSYnFwiMVtMT116Gx7b4Gp1ut0DoOkIxFEj
         xM1Akj5uVYqYx34KrckhuAyPCV25r0cUtJE+5h01i6vPlSzaF+bnjvOwhuT1HEDkOOm9
         YSveugrLIckIvPeEkS3dGaPt/gga4Q5y2DP+uG8Gr7SyGcV0Xh0B00kXeR/6fTGK4xGm
         zVCeIXx7O1umMGV8FqnpWIe++Z4tjNcCofUxs/rIKccLAq8aWq7oZCMb+weIxedqB6UO
         nldsXD34wL0Do4xLgEsJtOAiTu3IH9NDCE0s8Gu5e9SDTV9zJSnWMSzqYhr8+lkZe3U7
         jyVw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691106675; x=1691711475;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=jJFHaRySOlD5+a75VyE5azWXSPuX0PJG9XGQ53FQ3Lw=;
        b=C+kfqur+xy15TMHJNAM44vGFHyOUdo2DRrfDQWoxJNk+3uVp8+BWMUH8rWaT+kgsTI
         3RviR+EOEh7koyF4tU+m4sSTwlrYKxIrQ/9tOyZzE2VrwH32h1O1FI0/vbzjI+x0pL2D
         DCTEXmZGCWwu+A0RTk7v+T2FrghS0KpDbyBDVOnnOvKSKf6mOR724gy9+ApZKdYtz0HQ
         +ZzCHedgo1n8JEsN7+pIgx/Qj3+h+1ooJpRzH/8CkoIiBZwvOF+eZ0GpLTjstAGyJ90z
         hca/gT7/7gq8nkBnmUVwa+MfL+Q3IKUrop9DIJdQh0Ubba2bTvgawgdnMuOd3KDdx7mS
         hVeA==
X-Gm-Message-State: AOJu0YxpeNt/2UBJq4sq7XQDYyp4deWAqW7ybObZwVA6whTSwW6ZLQwR
	WiIFQ1CIMipE0O8F6S9JDSZoFFJDUFpJNBPXdvcHag==
X-Google-Smtp-Source: AGHT+IE+dqL2ZYo2ZYC6cXNfApDKg7uqt+XXLsJQBM7jB05LXsXiEvBvTk2p0r7CXfK4YVay33RCkgkWmFdawhiPqCE=
X-Received: by 2002:ac8:7dcd:0:b0:405:3a65:b3d6 with SMTP id
 c13-20020ac87dcd000000b004053a65b3d6mr34987qte.13.1691106675498; Thu, 03 Aug
 2023 16:51:15 -0700 (PDT)
MIME-Version: 1.0
References: <20230726095146.2826796-1-ryan.roberts@arm.com>
 <20230726095146.2826796-3-ryan.roberts@arm.com> <c02a95e9-b728-ad64-6942-f23dbd66af0c@arm.com>
In-Reply-To: <c02a95e9-b728-ad64-6942-f23dbd66af0c@arm.com>
From: Yu Zhao <yuzhao@google.com>
Date: Thu, 3 Aug 2023 17:50:38 -0600
Message-ID: <CAOUHufaHH3Ctu3JRHSbmebHJ7XPnBEWTQ4mwOo+MGXU9yKvwbA@mail.gmail.com>
Subject: Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, 
	Yin Fengwei <fengwei.yin@intel.com>, David Hildenbrand <david@redhat.com>, 
	Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, 
	Anshuman Khandual <anshuman.khandual@arm.com>, Yang Shi <shy828301@gmail.com>, 
	"Huang, Ying" <ying.huang@intel.com>, Zi Yan <ziy@nvidia.com>, 
	Luis Chamberlain <mcgrof@kernel.org>, Itaru Kitayama <itaru.kitayama@gmail.com>, 
	"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 7A1B4180012
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: 8biwk3zbhdynufeaorz75ur9xiix884k
X-HE-Tag: 1691106676-811223
X-HE-Meta: U2FsdGVkX1/zUtj4mrHUQM1GH/OnCWCdVJPU45Q/qI1jblv71flgIVNU7EtbvTwW3WtfgXvCsdydJE8vXsy823onmamMnw+pzXyu2KYGD5DEpLHv9XEhmin/+RKktLaT6FnuKDRqntiP8lKHDCjPIpWORJ1jXXAMLSX4NalCsJGHIhkA1EmuzryNUm1sEAvxTPKXX+9HV+I2FP3H1FI3fxlFn8puDVcigCbCELRcO9aG+5GwOAcgend5fUct1ERJScGpLf1mcvsMZmXrZ1CqESr2P/EjNyOcemgwJUaQ+IQJhuVGrEV/Gz6hSyBZ1h3OlyT/rrjxg60MujAE5RsDtbV58PPXLBzGFbLv5qRtiqf94JYBUTUqNEN/Q0Q6GQECgCQsWw2d1jxvyZH8VoctbUPEGbVwOKt9HNPJlWkGTDofTaR5rPTk1cU7B7iucATmyZikkt/zQT+1RtlVm2dNUbnzWqsrKVp5OScQ+ow/g7HVxb4MJZnvpIW2v/qN7l7sD7zH2oxHYO1rduH7lYkPhXrTSro5ud8w/A+Itb/C+TVTqzHuNO9mJhWfum1aeg/N+dGqOo4gtsW2rcgJcpuqbjHZe/R/ZOcCbIKp2ykCEiP8u8RJx3xxFavBYMIRPYjV7YdSCDivakdp9a8whPhnaqFd9YvYMj/CO1A69S+arPLRUateCn3TGAwO1DUscOiwg6Y5Vz7P4tytL8QKafjPJB9MQqCjLhcDYrSNOGrF62cxgiECnGg/6fSB3RzHNWLZDRCz35Ct6NcExaJ7qzu4p+iUNebJhJPyTMIUsghqgLvG3CK3jBK1dD5r5KDZxExOcZbY1d2uDsRYGTF6KNcnL0dQyidOPbwAF2su4zPrjsgKpkqgH8rH9vnaImfyylQ4ceT2mapP/WXX8n+pkbVADADAE/3rQ9NNFrNSml9MFRZSsZR85DCmeRlyW3juAVsliCY0Ie3VzHHqFEynZ+M
 kRjs2LAC
 ircNgUMAsMY1mXZ6EOSQrbu5my/Zh6P5BL7hmBHj39xjjqWC7FbocmIriYniJOb3XYo9ue1JJAURM+9Z9faAhnfY9cm2y+GiZ99LoGlZ0O4u8mNngRFIVkYhN+PhwwupzqolRKxh1i92RSczCv6C3vKwMci2P7Nw+F2v1HSXrolLX6H3SddYmgWRGRJxQTkbhCT+lxqa2YIch6ZX+l4LPAkS6yTAS2uZTlRwuQkPmN6aMgPsP8l4MPN5w+Sg62w0NofQLtl6DB5x220YFGSsZOFvYHa68UHWlXRioRjf+tCSk0A56nmFU//UpFSU6fKAmVBh3q0lV00O4XIa89A/cTtJeK9B3mmzr/zvwoWVrRKU4ij+Fx4PX6WBh02r/MzBilSSLDXGABAe4LxbQs/Z+bcUkVw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Aug 3, 2023 at 6:43=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com> =
wrote:
>
> + Kirill
>
> On 26/07/2023 10:51, Ryan Roberts wrote:
> > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be
> > allocated in large folios of a determined order. All pages of the large
> > folio are pte-mapped during the same page fault, significantly reducing
> > the number of page faults. The number of per-page operations (e.g. ref
> > counting, rmap management lru list management) are also significantly
> > reduced since those ops now become per-folio.
> >
> > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig,
> > which defaults to disabled for now; The long term aim is for this to
> > defaut to enabled, but there are some risks around internal
> > fragmentation that need to be better understood first.
> >
> > When enabled, the folio order is determined as such: For a vma, process
> > or system that has explicitly disabled THP, we continue to allocate
> > order-0. THP is most likely disabled to avoid any possible internal
> > fragmentation so we honour that request.
> >
> > Otherwise, the return value of arch_wants_pte_order() is used. For vmas
> > that have not explicitly opted-in to use transparent hugepages (e.g.
> > where thp=3Dmadvise and the vma does not have MADV_HUGEPAGE), then
> > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is
> > bigger). This allows for a performance boost without requiring any
> > explicit opt-in from the workload while limitting internal
> > fragmentation.
> >
> > If the preferred order can't be used (e.g. because the folio would
> > breach the bounds of the vma, or because ptes in the region are already
> > mapped) then we fall back to a suitable lower order; first
> > PAGE_ALLOC_COSTLY_ORDER, then order-0.
> >
>
> ...
>
> > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \
> > +             (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SH=
IFT)
> > +
> > +static int anon_folio_order(struct vm_area_struct *vma)
> > +{
> > +     int order;
> > +
> > +     /*
> > +      * If THP is explicitly disabled for either the vma, the process =
or the
> > +      * system, then this is very likely intended to limit internal
> > +      * fragmentation; in this case, don't attempt to allocate a large
> > +      * anonymous folio.
> > +      *
> > +      * Else, if the vma is eligible for thp, allocate a large folio o=
f the
> > +      * size preferred by the arch. Or if the arch requested a very sm=
all
> > +      * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDE=
R,
> > +      * which still meets the arch's requirements but means we still t=
ake
> > +      * advantage of SW optimizations (e.g. fewer page faults).
> > +      *
> > +      * Finally if thp is enabled but the vma isn't eligible, take the
> > +      * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHIN=
TED.
> > +      * This ensures workloads that have not explicitly opted-in take =
benefit
> > +      * while capping the potential for internal fragmentation.
> > +      */
> > +
> > +     if ((vma->vm_flags & VM_NOHUGEPAGE) ||
> > +         test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) ||
> > +         !hugepage_flags_enabled())
> > +             order =3D 0;
> > +     else {
> > +             order =3D max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_O=
RDER);
> > +
> > +             if (!hugepage_vma_check(vma, vma->vm_flags, false, true, =
true))
> > +                     order =3D min(order, ANON_FOLIO_MAX_ORDER_UNHINTE=
D);
> > +     }
> > +
> > +     return order;
> > +}
>
>
> Hi All,
>
> I'm writing up the conclusions that we arrived at during discussion in th=
e THP
> meeting yesterday, regarding linkage with exiting THP ABIs. It would be g=
reat if
> I can get explicit "agree" or disagree + rationale from at least David, Y=
u and
> Kirill.
>
> In summary; I think we are converging on the approach that is already cod=
ed, but
> I'd like confirmation.
>
>
>
> The THP situation today
> -----------------------
>
>  - At system level: THP can be set to "never", "madvise" or "always"
>  - At process level: THP can be "never" or "defer to system setting"
>  - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE
>
> That gives us this table to describe how a page fault is handled, accordi=
ng to
> process state (columns) and vma flags (rows):
>
>                 | never     | madvise   | always
> ----------------|-----------|-----------|-----------
> no hint         | S         | S         | THP>S
> MADV_HUGEPAGE   | S         | THP>S     | THP>S
> MADV_NOHUGEPAGE | S         | S         | S
>
> Legend:
> S       allocate single page (PTE-mapped)
> LAF     allocate lage anon folio (PTE-mapped)
> THP     allocate THP-sized folio (PMD-mapped)
> >       fallback (usually because vma size/alignment insufficient for fol=
io)
>
>
>
> Principles for Large Anon Folios (LAF)
> --------------------------------------
>
> David tells us there are use cases today (e.g. qemu live migration) which=
 use
> MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faul=
ted"
> and these use cases will break (i.e. functionally incorrect) if this requ=
est is
> not honoured.

I don't remember David saying this. I think he was referring to UFFD,
not MADV_NOHUGEPAGE, when discussing what we need to absolutely
respect.