From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C867EB64D9 for ; Tue, 4 Jul 2023 23:48:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 318232800C5; Tue, 4 Jul 2023 19:48:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2C8922800B2; Tue, 4 Jul 2023 19:48:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 168802800C5; Tue, 4 Jul 2023 19:48:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 06A522800B2 for ; Tue, 4 Jul 2023 19:48:00 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C1115403A5 for ; Tue, 4 Jul 2023 23:47:59 +0000 (UTC) X-FDA: 80975569878.08.79BCCF4 Received: from mail-qt1-f170.google.com (mail-qt1-f170.google.com [209.85.160.170]) by imf16.hostedemail.com (Postfix) with ESMTP id DB09C180007 for ; Tue, 4 Jul 2023 23:47:57 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=aENPbEDV; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688514477; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vzsW9V6cYwbiTk6roJohmCrKlCh8UZOL6XV8hys3VC0=; b=oVvIxpEvg9YCxUwztESszoaMqhES58wYpOUkwKviSVuQ2n3HP7usENUSXlnfA865AuoC9K DwHQSgr2Ji2fj87Aj+QBCtpqlZ1sdvdAaxMfVXDFFx99PZQ4LNi/leiHBDERO06w3FmkUL yNO+29S0IdTm5ChZSVG5a2QEJ9tR8I0= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=aENPbEDV; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.170 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688514477; a=rsa-sha256; cv=none; b=Q3cNyqR3EzMTMjKIr0hgaEChuviEWALaHdN2SIbGJiwKvdZqGIiah93/mja/H/NLVE9Lge cd8a8iwQ6oEikhYTpfFwm4EhQ4m+C2Ya6tV+ygz3p6SL+5llgTtbrsZU43UzUMlPjKKoBp Jxb0jkfDH2whcyacc6evnWznmost4qI= Received: by mail-qt1-f170.google.com with SMTP id d75a77b69052e-401d1d967beso790171cf.0 for ; Tue, 04 Jul 2023 16:47:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1688514477; x=1691106477; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vzsW9V6cYwbiTk6roJohmCrKlCh8UZOL6XV8hys3VC0=; b=aENPbEDVN2wLWR/YyuQfA60jQp2CSZg91/i3dh1idVI6i+y04ds1g4UidyhRFB+Kua /Xw5RZe3VVRTKX18667MpwQ/lwxpbd6zVx8ayOGcHjLvGLCnJiHr9wG/skS2wpQUlnAH sOE6M3D8Tk3iT+02VU77gDZ1p4h0D3e/BlGkdn6tCdZT9hVYxOAZ6T590bXQNZ/N0O91 6EUPRB6J74Kk4LV4+Lce6/OWC12ZkL5Xcegm3gGvF1RFt/1g93hmXv5rpPMRD+LiWUFX Ps2NnW2p1CVbHLpw606wMm0qCCxB7qVT2yhYrdPkOeAZbvLLdKLgkkrkeyhUklldMQ1j 38IA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688514477; x=1691106477; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vzsW9V6cYwbiTk6roJohmCrKlCh8UZOL6XV8hys3VC0=; b=LvA28vKvuAh6d4+kFdqcXuoh34IkY9magN50zr7nqxxslileFMhOXR2nZleCuve+jp 413maEXcKfQmil6hXUeG0lkwKltNULJTEH5WJ6LEYiFHltPHn8tIBA3mvPY8PemSnQe3 iqE0k4JoAhHaID7shQ+fiQ3fBPZE/0MJ/cZE/SCwtyrwkAlxe7LBWTS/y21CeNwEAFgU V+Ll+yQ1f0xDv22gf5aQpEnWxOqjAscEIV04rQe1c/goakwtHCE/gQicnqiQBwE10mbd e7rLb0uSJn2sW6IOiIZwsJbzY5Ru5bgZlGqbLC3ck5bJXpzjA6+ODUgagR5jdiJdrm6Q Ybew== X-Gm-Message-State: ABy/qLYFLIj+1gI1SMYO5NEdeQZ2fHtFIo3Asm1PvpzwXXat+HwIXmwS Iydmlu9oWAj0RHjbsZ69y1dfIpi1eOLtWQjF6hJBpw== X-Google-Smtp-Source: APBJJlFGIJpdidlpiYWYiWyVVrCdDl+ScMHCwAcJcohMT1PeBBzoe7HzBh6g2H0tnuYdXDBZ/SPgV02h9Sj1ZW3MUzs= X-Received: by 2002:a05:622a:15ce:b0:3f0:af20:1a37 with SMTP id d14-20020a05622a15ce00b003f0af201a37mr10671qty.15.1688514476834; Tue, 04 Jul 2023 16:47:56 -0700 (PDT) MIME-Version: 1.0 References: <20230703135330.1865927-1-ryan.roberts@arm.com> <20230703135330.1865927-5-ryan.roberts@arm.com> <22b23340-5fad-01be-6bac-fa30a3724b27@arm.com> In-Reply-To: <22b23340-5fad-01be-6bac-fa30a3724b27@arm.com> From: Yu Zhao Date: Tue, 4 Jul 2023 17:47:20 -0600 Message-ID: Subject: Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance To: Ryan Roberts Cc: Andrew Morton , Matthew Wilcox , "Kirill A. Shutemov" , Yin Fengwei , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: DB09C180007 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: ujd6ue18e688cpxzeys4p1zdn7fo13oo X-HE-Tag: 1688514477-176041 X-HE-Meta: U2FsdGVkX19X8Qw0DfZuVIF8uPB1vBQu9nDoj6FRiLPnRD/DYbUSIkhiO12PR2rbR087hXKX5zRaP6lwbMH90/yTsBY4ingKt5gU8sAoJABTO6DSFrCCrYRj0tWL0SEVnqE5syUpl8LJS3aGmC42692nRnd9LtF3W5kc99Kc6XppDQnozXlD0j7oRn9F4o0aV2xR0wlFlXkdWcq8WY5yQIuh7/WMt1RpJzQdxKHAA2+B4b5mWgRAtQ+u3GzZGSoKgvIomp//Fh6XGraVAdYjkoDfZuR3VWsgoSLJKtacfr2V/kptCmYG0/ReTA/LuHa/HwSgdnPAWH3uLqDNIcp3Sini7a/cFmPhRKmiBwr5c1zQuEsQG2HLxnvfSsZHLMuEpqEwOcCZMmlrBZsoSt6+YI//ZAWA6eVtdJdGXTtoJ9pGbwZQPwz2o4Gij+pg1COtibvV8EwFir7jcAwu9DjgVo26ptRzDVx/fxxvJyOLUrxsX9zGYwetIw/198x2xIGeb6tH5CMhcAdvLRhP6LN+bS9bLgF0uzNnS96H7jlJRZNOcxDYS/zsleCgK9bHHD6iDjzhAVORYZtUnE9igAKoNLKOMFy+704gl7gbxc7Jf/L0iaPkf0fv+mReSMyrBCobtVxJoeXXzJaXMMN1hPKszD9ljzTFT8n4xJeP20zx7RUdOhGBZPd2ssFJHRw8H5smy7eosHXNYUkS0cvmkC+/Tcha25fm5IcLUrXm7kAYxmC0616vDOb9FFB6OxHzNI9uKHMEYAyx2G0AWqdZZ2Mjy+6yVzsVE2CZ0STGTGrXQjABADQOj5k7VnJLLnw+WlAijSzUCjgL7maUcaktCjXZNNk+DASKjhQZod6JYz5rWBGLZvLVFQ4aLZGD7S8BKyHkhYtqYTf6e80MEj/jDupICxO12FGNi17C2E4lCHC0aEM97n02rf+GbPXKet9HSm3C6B+0NCaCoo73XEqroBf iOy3PZY3 mm+kKocgxEiK68isbKxkCbwW332nLH4uknJIv/uvMoOWu/W6H/wVsclj6pTS4Tng35ao7HC6n/JTRp5zGGamWfdCV1feaJJR9OxKfZbalNDtK7ok2MEAOtm1MBT268tSnOuFOKZMnhEuNeUhzoytOBy2utqskwO9l4KFBJPfvTwWKLkqwYHPGBu7hCMkw0vCXR4WZaJA/bsknOi/g7hg14LdqAayroQAD1SA4oEOC27RyL2SMFv0B9BtX5mUWO/Ecc595qEZ57rQm2nppcOnNdJnQ2LobeTB5lOH/DsRXRc5+HGrVwUv7ahYYUQNPJxr+Rzx/7/OpqfbA/2jANOXZBtURrA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 4, 2023 at 8:08=E2=80=AFAM Ryan Roberts = wrote: > > On 04/07/2023 02:35, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53=E2=80=AFAM Ryan Roberts wrote: > >> > >> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be > >> allocated in large folios of a specified order. All pages of the large > >> folio are pte-mapped during the same page fault, significantly reducin= g > >> the number of page faults. The number of per-page operations (e.g. ref > >> counting, rmap management lru list management) are also significantly > >> reduced since those ops now become per-folio. > >> > >> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which > >> defaults to disabled for now; there is a long list of todos to make > >> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, so= me > >> madvise ops, etc). These items will be tackled in subsequent patches. > >> > >> When enabled, the preferred folio order is as returned by > >> arch_wants_pte_order(), which may be overridden by the arch as it sees > >> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a > > > > coalesce > > ACK > > > > >> contiguous set of ptes map physically contigious, naturally aligned > > > > contiguous > > ACK > > > > >> memory, so this mechanism allows the architecture to optimize as > >> required. > >> > >> If the preferred order can't be used (e.g. because the folio would > >> breach the bounds of the vma, or because ptes in the region are alread= y > >> mapped) then we fall back to a suitable lower order. > >> > >> Signed-off-by: Ryan Roberts > >> --- > >> mm/Kconfig | 10 ++++ > >> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++---= - > >> 2 files changed, 165 insertions(+), 13 deletions(-) > >> > >> diff --git a/mm/Kconfig b/mm/Kconfig > >> index 7672a22647b4..1c06b2c0a24e 100644 > >> --- a/mm/Kconfig > >> +++ b/mm/Kconfig > >> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS > >> support of file THPs will be developed in the next few relea= se > >> cycles. > >> > >> +config FLEXIBLE_THP > >> + bool "Flexible order THP" > >> + depends on TRANSPARENT_HUGEPAGE > >> + default n > > > > The default value is already N. > > Is there a coding standard for this? Personally I prefer to make it expli= cit. > > > > >> + help > >> + Use large (bigger than order-0) folios to back anonymous mem= ory where > >> + possible, even if the order of the folio is smaller than the= PMD > >> + order. This reduces the number of page faults, as well as ot= her > >> + per-page overheads to improve performance for many workloads= . > >> + > >> endif # TRANSPARENT_HUGEPAGE > >> > >> # > >> diff --git a/mm/memory.c b/mm/memory.c > >> index fb30f7523550..abe2ea94f3f5 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(stru= ct vm_fault *vmf) > >> return 0; > >> } > >> > >> +#ifdef CONFIG_FLEXIBLE_THP > >> +/* > >> + * Allocates, zeros and returns a folio of the requested order for us= e as > >> + * anonymous memory. > >> + */ > >> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma, > >> + unsigned long addr, int order) > >> +{ > >> + gfp_t gfp; > >> + struct folio *folio; > >> + > >> + if (order =3D=3D 0) > >> + return vma_alloc_zeroed_movable_folio(vma, addr); > >> + > >> + gfp =3D vma_thp_gfp_mask(vma); > >> + folio =3D vma_alloc_folio(gfp, order, vma, addr, true); > >> + if (folio) > >> + clear_huge_page(&folio->page, addr, folio_nr_pages(fol= io)); > >> + > >> + return folio; > >> +} > >> + > >> +/* > >> + * Preferred folio order to allocate for anonymous memory. > >> + */ > >> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma) > >> +#else > >> +#define alloc_anon_folio(vma, addr, order) \ > >> + vma_alloc_zeroed_movable_folio(vma, ad= dr) > >> +#define max_anon_folio_order(vma) 0 > >> +#endif > >> + > >> +/* > >> + * Returns index of first pte that is not none, or nr if all are none= . > >> + */ > >> +static inline int check_ptes_none(pte_t *pte, int nr) > >> +{ > >> + int i; > >> + > >> + for (i =3D 0; i < nr; i++) { > >> + if (!pte_none(ptep_get(pte++))) > >> + return i; > >> + } > >> + > >> + return nr; > >> +} > >> + > >> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int orde= r) > >> +{ > >> + /* > >> + * The aim here is to determine what size of folio we should a= llocate > >> + * for this fault. Factors include: > >> + * - Order must not be higher than `order` upon entry > >> + * - Folio must be naturally aligned within VA space > >> + * - Folio must be fully contained inside one pmd entry > >> + * - Folio must not breach boundaries of vma > >> + * - Folio must not overlap any non-none ptes > >> + * > >> + * Additionally, we do not allow order-1 since this breaks ass= umptions > >> + * elsewhere in the mm; THP pages must be at least order-2 (si= nce they > >> + * store state up to the 3rd struct page subpage), and these p= ages must > >> + * be THP in order to correctly use pre-existing THP infrastru= cture such > >> + * as folio_split(). > >> + * > >> + * Note that the caller may or may not choose to lock the pte.= If > >> + * unlocked, the result is racy and the user must re-check any= overlap > >> + * with non-none ptes under the lock. > >> + */ > >> + > >> + struct vm_area_struct *vma =3D vmf->vma; > >> + int nr; > >> + unsigned long addr; > >> + pte_t *pte; > >> + pte_t *first_set =3D NULL; > >> + int ret; > >> + > >> + order =3D min(order, PMD_SHIFT - PAGE_SHIFT); > >> + > >> + for (; order > 1; order--) { > > > > I'm not sure how we can justify this policy. As an initial step, it'd > > be a lot easier to sell if we only considered the order of > > arch_wants_pte_order() and the order 0. > > My justification is in the cover letter; I see performance regression (vs= the > unpatched kernel) when using the policy you suggest. This policy performs= much > better in my tests. (I'll reply directly to your follow up questions in t= he > cover letter shortly). > > What are your technical concerns about this approach? It is pretty light = weight > (I only touch each PTE once, regardless of the number of loops). If we ha= ve > strong technical reasons for reverting to the less performant approach th= en fair > enough, but I'd like to hear the rational first. Yes, mainly from three different angles: 1. The engineering principle: we'd want to separate the mechanical part and the policy part when attacking something large. This way it'd be easier to root cause any regressions if they happen. In our case, assuming the regression is real, it might actually prove my point here: I really don't think the two checks (if a vma range fits and if it does, which is unlikely according to your description, if all 64 PTEs are none) caused the regression. My theory is that 64KB itself caused the regression, but smaller sizes made an improvement. If this is really the case, I'd say the fallback policy masked the real problem, which is that 64KB is too large to begin with. 2. The benchmark methodology: I appreciate your effort in doing it, but we also need to consider that the setup is an uncommon scenario. The common scenarios are devices that have been running for weeks without reboots, generally having higher external fragmentation. In addition, for client devices, they are often under memory pressure, which makes fragmentation worse. So we should take the result with a grain of salt, and for that matter, results from after refresh reboots. 3. The technical concern: an ideal policy would consider all three major factors: the h/w features, userspace behaviors and the page allocator behavior. So far we only have the first one handy. The second one is too challenging, so let's forget about it for now. The third one is why I really don't like this best-fit policy. By falling back to smaller orders, we can waste a limited number of physically contiguous pages on wrong vmas (small vmas only), leading to failures to serve large vmas which otherwise would have a higher overall ROI. This can only be addressed within the page allocator: we need to enlighten it to return the highest order available, i.e., not breaking up any higher orders. I'm not really saying we should never try this fallback policy. I'm just thinking we can leave it for later, probably after we've addressed all the concerns with basic functionality.