From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F6EAC001DB for ; Thu, 3 Aug 2023 23:51:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 150C72802A5; Thu, 3 Aug 2023 19:51:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1022528022C; Thu, 3 Aug 2023 19:51:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0B062802A5; Thu, 3 Aug 2023 19:51:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E109328022C for ; Thu, 3 Aug 2023 19:51:18 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 42BDA1204CC for ; Thu, 3 Aug 2023 23:51:18 +0000 (UTC) X-FDA: 81084442236.13.1E019FF Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) by imf06.hostedemail.com (Postfix) with ESMTP id 7A1B4180012 for ; Thu, 3 Aug 2023 23:51:16 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=wUM8MPQ3; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691106676; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jJFHaRySOlD5+a75VyE5azWXSPuX0PJG9XGQ53FQ3Lw=; b=ghetuLLjrmvhkSb26n5Mg1vOgX8SWldF0USWAsY4SAdA7LF/o0m01y8ksRDJAWNHMfqgQc 8nsmlIZVxJj+u8ItOL6Sa4zR+krjskeDdgbt3s/ZJWkkGBdv9xb4VqqyPHffvJffj7mF/D SifK2LB0NLrY4MQ3LzvRN2YLrvzn0X8= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=wUM8MPQ3; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf06.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.172 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691106676; a=rsa-sha256; cv=none; b=NKbdDVhvLIYxhc4nGF7sy71JASK4Rqw8tJbYreeK2K+dDNu7zcuZL9luv7kxJlJdaND5RK nJt9+nTwKKtLmsua9syrXb424tfJGJVnJYMRTAy/m+dZmuSSHWcX/aYWmJXE2R+CE0mHH/ ynbvXVkdQXhU9iQXdB2P0MvYrrWmlGo= Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-40c72caec5cso134431cf.0 for ; Thu, 03 Aug 2023 16:51:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691106675; x=1691711475; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=jJFHaRySOlD5+a75VyE5azWXSPuX0PJG9XGQ53FQ3Lw=; b=wUM8MPQ32uXwz3fhBveoBzB4MSc3nb+xSYnFwiMVtMT116Gx7b4Gp1ut0DoOkIxFEj xM1Akj5uVYqYx34KrckhuAyPCV25r0cUtJE+5h01i6vPlSzaF+bnjvOwhuT1HEDkOOm9 YSveugrLIckIvPeEkS3dGaPt/gga4Q5y2DP+uG8Gr7SyGcV0Xh0B00kXeR/6fTGK4xGm zVCeIXx7O1umMGV8FqnpWIe++Z4tjNcCofUxs/rIKccLAq8aWq7oZCMb+weIxedqB6UO nldsXD34wL0Do4xLgEsJtOAiTu3IH9NDCE0s8Gu5e9SDTV9zJSnWMSzqYhr8+lkZe3U7 jyVw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691106675; x=1691711475; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jJFHaRySOlD5+a75VyE5azWXSPuX0PJG9XGQ53FQ3Lw=; b=C+kfqur+xy15TMHJNAM44vGFHyOUdo2DRrfDQWoxJNk+3uVp8+BWMUH8rWaT+kgsTI 3RviR+EOEh7koyF4tU+m4sSTwlrYKxIrQ/9tOyZzE2VrwH32h1O1FI0/vbzjI+x0pL2D DCTEXmZGCWwu+A0RTk7v+T2FrghS0KpDbyBDVOnnOvKSKf6mOR724gy9+ApZKdYtz0HQ +ZzCHedgo1n8JEsN7+pIgx/Qj3+h+1ooJpRzH/8CkoIiBZwvOF+eZ0GpLTjstAGyJ90z hca/gT7/7gq8nkBnmUVwa+MfL+Q3IKUrop9DIJdQh0Ubba2bTvgawgdnMuOd3KDdx7mS hVeA== X-Gm-Message-State: AOJu0YxpeNt/2UBJq4sq7XQDYyp4deWAqW7ybObZwVA6whTSwW6ZLQwR WiIFQ1CIMipE0O8F6S9JDSZoFFJDUFpJNBPXdvcHag== X-Google-Smtp-Source: AGHT+IE+dqL2ZYo2ZYC6cXNfApDKg7uqt+XXLsJQBM7jB05LXsXiEvBvTk2p0r7CXfK4YVay33RCkgkWmFdawhiPqCE= X-Received: by 2002:ac8:7dcd:0:b0:405:3a65:b3d6 with SMTP id c13-20020ac87dcd000000b004053a65b3d6mr34987qte.13.1691106675498; Thu, 03 Aug 2023 16:51:15 -0700 (PDT) MIME-Version: 1.0 References: <20230726095146.2826796-1-ryan.roberts@arm.com> <20230726095146.2826796-3-ryan.roberts@arm.com> In-Reply-To: From: Yu Zhao Date: Thu, 3 Aug 2023 17:50:38 -0600 Message-ID: Subject: Re: [PATCH v4 2/5] mm: LARGE_ANON_FOLIO for improved performance To: Ryan Roberts Cc: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7A1B4180012 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 8biwk3zbhdynufeaorz75ur9xiix884k X-HE-Tag: 1691106676-811223 X-HE-Meta: U2FsdGVkX1/zUtj4mrHUQM1GH/OnCWCdVJPU45Q/qI1jblv71flgIVNU7EtbvTwW3WtfgXvCsdydJE8vXsy823onmamMnw+pzXyu2KYGD5DEpLHv9XEhmin/+RKktLaT6FnuKDRqntiP8lKHDCjPIpWORJ1jXXAMLSX4NalCsJGHIhkA1EmuzryNUm1sEAvxTPKXX+9HV+I2FP3H1FI3fxlFn8puDVcigCbCELRcO9aG+5GwOAcgend5fUct1ERJScGpLf1mcvsMZmXrZ1CqESr2P/EjNyOcemgwJUaQ+IQJhuVGrEV/Gz6hSyBZ1h3OlyT/rrjxg60MujAE5RsDtbV58PPXLBzGFbLv5qRtiqf94JYBUTUqNEN/Q0Q6GQECgCQsWw2d1jxvyZH8VoctbUPEGbVwOKt9HNPJlWkGTDofTaR5rPTk1cU7B7iucATmyZikkt/zQT+1RtlVm2dNUbnzWqsrKVp5OScQ+ow/g7HVxb4MJZnvpIW2v/qN7l7sD7zH2oxHYO1rduH7lYkPhXrTSro5ud8w/A+Itb/C+TVTqzHuNO9mJhWfum1aeg/N+dGqOo4gtsW2rcgJcpuqbjHZe/R/ZOcCbIKp2ykCEiP8u8RJx3xxFavBYMIRPYjV7YdSCDivakdp9a8whPhnaqFd9YvYMj/CO1A69S+arPLRUateCn3TGAwO1DUscOiwg6Y5Vz7P4tytL8QKafjPJB9MQqCjLhcDYrSNOGrF62cxgiECnGg/6fSB3RzHNWLZDRCz35Ct6NcExaJ7qzu4p+iUNebJhJPyTMIUsghqgLvG3CK3jBK1dD5r5KDZxExOcZbY1d2uDsRYGTF6KNcnL0dQyidOPbwAF2su4zPrjsgKpkqgH8rH9vnaImfyylQ4ceT2mapP/WXX8n+pkbVADADAE/3rQ9NNFrNSml9MFRZSsZR85DCmeRlyW3juAVsliCY0Ie3VzHHqFEynZ+M kRjs2LAC ircNgUMAsMY1mXZ6EOSQrbu5my/Zh6P5BL7hmBHj39xjjqWC7FbocmIriYniJOb3XYo9ue1JJAURM+9Z9faAhnfY9cm2y+GiZ99LoGlZ0O4u8mNngRFIVkYhN+PhwwupzqolRKxh1i92RSczCv6C3vKwMci2P7Nw+F2v1HSXrolLX6H3SddYmgWRGRJxQTkbhCT+lxqa2YIch6ZX+l4LPAkS6yTAS2uZTlRwuQkPmN6aMgPsP8l4MPN5w+Sg62w0NofQLtl6DB5x220YFGSsZOFvYHa68UHWlXRioRjf+tCSk0A56nmFU//UpFSU6fKAmVBh3q0lV00O4XIa89A/cTtJeK9B3mmzr/zvwoWVrRKU4ij+Fx4PX6WBh02r/MzBilSSLDXGABAe4LxbQs/Z+bcUkVw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Aug 3, 2023 at 6:43=E2=80=AFAM Ryan Roberts = wrote: > > + Kirill > > On 26/07/2023 10:51, Ryan Roberts wrote: > > Introduce LARGE_ANON_FOLIO feature, which allows anonymous memory to be > > allocated in large folios of a determined order. All pages of the large > > folio are pte-mapped during the same page fault, significantly reducing > > the number of page faults. The number of per-page operations (e.g. ref > > counting, rmap management lru list management) are also significantly > > reduced since those ops now become per-folio. > > > > The new behaviour is hidden behind the new LARGE_ANON_FOLIO Kconfig, > > which defaults to disabled for now; The long term aim is for this to > > defaut to enabled, but there are some risks around internal > > fragmentation that need to be better understood first. > > > > When enabled, the folio order is determined as such: For a vma, process > > or system that has explicitly disabled THP, we continue to allocate > > order-0. THP is most likely disabled to avoid any possible internal > > fragmentation so we honour that request. > > > > Otherwise, the return value of arch_wants_pte_order() is used. For vmas > > that have not explicitly opted-in to use transparent hugepages (e.g. > > where thp=3Dmadvise and the vma does not have MADV_HUGEPAGE), then > > arch_wants_pte_order() is limited to 64K (or PAGE_SIZE, whichever is > > bigger). This allows for a performance boost without requiring any > > explicit opt-in from the workload while limitting internal > > fragmentation. > > > > If the preferred order can't be used (e.g. because the folio would > > breach the bounds of the vma, or because ptes in the region are already > > mapped) then we fall back to a suitable lower order; first > > PAGE_ALLOC_COSTLY_ORDER, then order-0. > > > > ... > > > +#define ANON_FOLIO_MAX_ORDER_UNHINTED \ > > + (ilog2(max_t(unsigned long, SZ_64K, PAGE_SIZE)) - PAGE_SH= IFT) > > + > > +static int anon_folio_order(struct vm_area_struct *vma) > > +{ > > + int order; > > + > > + /* > > + * If THP is explicitly disabled for either the vma, the process = or the > > + * system, then this is very likely intended to limit internal > > + * fragmentation; in this case, don't attempt to allocate a large > > + * anonymous folio. > > + * > > + * Else, if the vma is eligible for thp, allocate a large folio o= f the > > + * size preferred by the arch. Or if the arch requested a very sm= all > > + * size or didn't request a size, then use PAGE_ALLOC_COSTLY_ORDE= R, > > + * which still meets the arch's requirements but means we still t= ake > > + * advantage of SW optimizations (e.g. fewer page faults). > > + * > > + * Finally if thp is enabled but the vma isn't eligible, take the > > + * arch-preferred size and limit it to ANON_FOLIO_MAX_ORDER_UNHIN= TED. > > + * This ensures workloads that have not explicitly opted-in take = benefit > > + * while capping the potential for internal fragmentation. > > + */ > > + > > + if ((vma->vm_flags & VM_NOHUGEPAGE) || > > + test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags) || > > + !hugepage_flags_enabled()) > > + order =3D 0; > > + else { > > + order =3D max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_O= RDER); > > + > > + if (!hugepage_vma_check(vma, vma->vm_flags, false, true, = true)) > > + order =3D min(order, ANON_FOLIO_MAX_ORDER_UNHINTE= D); > > + } > > + > > + return order; > > +} > > > Hi All, > > I'm writing up the conclusions that we arrived at during discussion in th= e THP > meeting yesterday, regarding linkage with exiting THP ABIs. It would be g= reat if > I can get explicit "agree" or disagree + rationale from at least David, Y= u and > Kirill. > > In summary; I think we are converging on the approach that is already cod= ed, but > I'd like confirmation. > > > > The THP situation today > ----------------------- > > - At system level: THP can be set to "never", "madvise" or "always" > - At process level: THP can be "never" or "defer to system setting" > - At VMA level: no-hint, MADV_HUGEPAGE, MADV_NOHUGEPAGE > > That gives us this table to describe how a page fault is handled, accordi= ng to > process state (columns) and vma flags (rows): > > | never | madvise | always > ----------------|-----------|-----------|----------- > no hint | S | S | THP>S > MADV_HUGEPAGE | S | THP>S | THP>S > MADV_NOHUGEPAGE | S | S | S > > Legend: > S allocate single page (PTE-mapped) > LAF allocate lage anon folio (PTE-mapped) > THP allocate THP-sized folio (PMD-mapped) > > fallback (usually because vma size/alignment insufficient for fol= io) > > > > Principles for Large Anon Folios (LAF) > -------------------------------------- > > David tells us there are use cases today (e.g. qemu live migration) which= use > MADV_NOHUGEPAGE to mean "don't fill any PTEs that are not explicitly faul= ted" > and these use cases will break (i.e. functionally incorrect) if this requ= est is > not honoured. I don't remember David saying this. I think he was referring to UFFD, not MADV_NOHUGEPAGE, when discussing what we need to absolutely respect.