From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 55F42C6FD1F
	for <linux-mm@archiver.kernel.org>; Tue,  2 Apr 2024 09:40:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E94846B008C; Tue,  2 Apr 2024 05:40:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E1DDC6B0092; Tue,  2 Apr 2024 05:40:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CC9086B0093; Tue,  2 Apr 2024 05:40:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id AA2536B008C
	for <linux-mm@kvack.org>; Tue,  2 Apr 2024 05:40:52 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 3D65EA09F9
	for <linux-mm@kvack.org>; Tue,  2 Apr 2024 09:40:52 +0000 (UTC)
X-FDA: 81964097544.08.D4667EC
Received: from mail-ua1-f41.google.com (mail-ua1-f41.google.com [209.85.222.41])
	by imf11.hostedemail.com (Postfix) with ESMTP id 9B8C140004
	for <linux-mm@kvack.org>; Tue,  2 Apr 2024 09:40:50 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=PeWVAkgy;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1712050850;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=nt5INoP8IhjE1zWspz/OtNYRxF6tpYnnR31EKgg4t+o=;
	b=BRhUnk0aX9xd5nklVnA7uJ6/zUu6SVg9/JGJYwopEEsiQMYBf9+giS0m8udRJLDbPQiU/o
	AKkn/NGSFWn83+6B6YUMZH4rG2rYdhFEeNoIp8D8MAYuxIOcVgJR3ZBRT9idfcsctAIuI8
	KxzQeIngsRHIyWxVc8P25kQmflEJKTw=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=PeWVAkgy;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712050850; a=rsa-sha256;
	cv=none;
	b=7dcPAoI2xG/x6X9NqSj2qruOkv0TJKi7GrlehchJZC/sNFWyKjdZrl8MQG5COtpdc7eCs/
	YR/DonB/12y18ioDMGe9yEvjRqc41verIfya4rbI8LizIwLSWnkF+jTfELEjrt47b263F3
	wTGNposngNmRTQbBVgc91DAbtQN/zHg=
Received: by mail-ua1-f41.google.com with SMTP id a1e0cc1a2514c-7dfacd39b9eso3610961241.1
        for <linux-mm@kvack.org>; Tue, 02 Apr 2024 02:40:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1712050850; x=1712655650; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=nt5INoP8IhjE1zWspz/OtNYRxF6tpYnnR31EKgg4t+o=;
        b=PeWVAkgydr1c3uKe3ruZ4mCev7VQtVXBezQm7WpTsc8mwit1S0w08MP77myyYSvF5d
         oUsBCPc8Lpa1V9LQIIJxdBbVM9kGq1Ly28OTGQsQtOCjunF2o0NWRxvlZSRwKIYEgkx2
         svMdGixuzb/u2dDDUrSoeBCbXTti01IEA3jteoEd3Pk49sfcKj/bBYd6l6TQEzyYPriQ
         GbbMXVlcxKo2gTdTTak9jbLV2mD+CM4HdoODar0VeuFeo+O/Sva3L38qwjJja8NK1oR7
         ReVuwyOPvriFcnyDB+MRZtK3dM6yTcK1D36U7AAAIHCaD9u8/Wj0P6Ru+Ba+TTnI1bEH
         IDcQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1712050850; x=1712655650;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=nt5INoP8IhjE1zWspz/OtNYRxF6tpYnnR31EKgg4t+o=;
        b=GdhNxdl8EW+QKAYRObTxRkqEMSUDD6JFmpNGI6MK933hg+YMLM/P1jxsmouuG2OOFL
         9Me3hJ6n3oco1UlKgWQYmo5MEaW6llO8L/1g9oLDdhGo9Dfo6AxZzR+PgEDD3+E4+8iL
         S303xtHBq1XAwgNzLKXwnYSdZTJO2FzmxNo1vsTFhBf54w6fVN0+euCorclbm6rwY54b
         +PhV9g8xIlz8QGdG+WElfSxXFnnGESIwOqisXduGH6pcn+BVsg17wktwGV6RlvfQlvOg
         E4OlsUB/HDrvMyocrwNwontpdP4xuDYOhDx3pOZKCAqPA0IM2orE/cJcsE2vMDNr7s+d
         0GYA==
X-Forwarded-Encrypted: i=1; AJvYcCWruacSfy967UNAD7Uwqlw0chY7Nrg5juZVr02tfsqsztsgkmwBjv3cIVRJUz1elsDFcUzoyk9+YoE8WcRFMOx1N8A=
X-Gm-Message-State: AOJu0YxDV0iw/9/kg75+lHItK40YJU3kYymUPeHgnSj8agucbYvXOz/P
	z6GMB8gHsbg/MmzyVcBj9Wx0JlJkOmcTUpDW0nYEgyJ6A5EPbRiRZfAtTru9U1h+gjdf0hgQpVh
	Q4Qn6lu/YFQMowzIWy1GP7Dzj2cg=
X-Google-Smtp-Source: AGHT+IET+aTqLnkT2NLsKYeLqei6jAqh+lBtTlvOh7VIoekBSxOvpiT4sk9WsISolrQ5+NhP0H3QbU607LUJBy81ZjA=
X-Received: by 2002:ac5:cbf5:0:b0:4b9:e8bd:3b2 with SMTP id
 i21-20020ac5cbf5000000b004b9e8bd03b2mr7505871vkn.2.1712050849662; Tue, 02 Apr
 2024 02:40:49 -0700 (PDT)
MIME-Version: 1.0
References: <20240328095139.143374-1-21cnbao@gmail.com> <e28a1419-d586-4d5e-a4cb-ae9400f05a44@arm.com>
In-Reply-To: <e28a1419-d586-4d5e-a4cb-ae9400f05a44@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 2 Apr 2024 22:40:38 +1300
Message-ID: <CAGsJ_4zTNYFaHU24w8vKq7OjbP1YQG-3BQA49caX0ZhmLhSgwQ@mail.gmail.com>
Subject: Re: [PATCH v2] mm: add per-order mTHP alloc_success and alloc_fail counters
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, willy@infradead.org, 
	david@redhat.com, cerasuolodomenico@gmail.com, kasong@tencent.com, 
	surenb@google.com, v-songbaohua@oppo.com, yosryahmed@google.com, 
	yuzhao@google.com, chrisl@kernel.org, peterx@redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 9B8C140004
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: 9ox8q3du96c736wc581tsgywjbzzud41
X-HE-Tag: 1712050850-64720
X-HE-Meta: U2FsdGVkX1/tVePU7RWaAA4FV4sdtd4qgs+Cd2Y5dzxzUSZvHZuQ59lsyzfZNAiLQO1DXhRB58cJ8riKVDvTeuI6B2k6PWMiFzAURddvE5h9GdC0PDaCKl+8PXaLw5VO0Sh+/f8WGNIn297UHPhrWTRE4X7SJAh6joBqdyAC7+YDvwvGJjztPet0SFSoWd61fd1+DqH4JUvJdrcVEUW7TWuR9aaLogSUzKVaAu6EhMHRrg3eUkN4LJPLWG6ObTqaTLXxpzzm0JVwNWFloKGztJ+dwdC52fd6VHStjYihVnfZxpyjh7xb2bPKWUZ1ZdvOmWD//v3kCO9KBlu11aNE17/p5lNR8P7FVWdTsNwghXFcaMBz+6Qrk8KODOG74KIZEo1eQcT7/2WgCZPHRRfKk4YraYq8hSziJAE25gpKKq9QJk2ZCkP0cTDT8URSvcdLC2JZV08prUhNPEqUfiHi7bqC1cF+k+01mGIHe4CV5Mm4MifU8vourgIQMBPGURxN0xkK5EyGA8sf9VxMcYg97Ju8kcgLaJgtbrQlBd1QhqJdcCQxg33NSAYhXw1xdPUC4DSAaKrS5BFSHxcedSCwL3dOmScr4L5c/qmcnlkd7MPwF8NbB/OGwLMV9o5PnzPOoPiIDhXU5L3BfCs8vBn42wFvTVESTkyuxuyPG2CXe40AenX+KTejamz2MUSRDSC/vioJx5FrRCSYXn94eMv+hyYHP4JOE742nbvfDiLWyVlVy2gIxNKr/wRXhGGNbbT+gdBptHd2nL5cOyJjGcEArXtmEb/qr31Qo3e4W10mexU/0Lzw6G2735aL1xgArHNnCDi2mqbcg0hGKV7NQggjXH+htlLdsvc8skyc3uJ60DGXt4RRKOZSnzYK6YgRM4U3r+ZNR2ZuddSj1Z02ycAxPjq0Cg1+HIfF+4yChWn4pyJ3VgfYmfxIWzZN1qmyl7KZl3bMOcpdw26TtLAwatX
 gzENeues
 y/InHx/tsjM/reULXGs5zlF38GLzRT3LrpAWtJaHZoL6+RSk1YO6JXeP0qDCoztTH2tStyHDaWubs/FL7FYNCf8yGuQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Apr 2, 2024 at 9:58=E2=80=AFPM Ryan Roberts <ryan.roberts@arm.com> =
wrote:
>
> On 28/03/2024 09:51, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Profiling a system blindly with mTHP has become challenging due
> > to the lack of visibility into its operations. Presenting the
> > success rate of mTHP allocations appears to be pressing need.
> >
> > Recently, I've been experiencing significant difficulty debugging
> > performance improvements and regressions without these figures.
> > It's crucial for us to understand the true effectiveness of
> > mTHP in real-world scenarios, especially in systems with
> > fragmented memory.
> >
> > This patch sets up the framework for per-order mTHP counters,
> > starting with the introduction of alloc_success and alloc_fail
> > counters.  Incorporating additional counters should now be
> > straightforward as well.
> >
> > The initial two unsigned longs for each event are unused, given
> > that order-0 and order-1 are not mTHP. Nonetheless, this refinement
> > improves code clarity.
>
> Overall, this approach looks good to me - thanks for doing this!

Thank you for reviewing!

>
> >
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  -v2:
> >  * move to sysfs and provide per-order counters; David, Ryan, Willy
> >  -v1:
> >  https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.=
com/
> >
> >  include/linux/huge_mm.h | 17 +++++++++++++
> >  mm/huge_memory.c        | 54 +++++++++++++++++++++++++++++++++++++++++
> >  mm/memory.c             |  3 +++
> >  3 files changed, 74 insertions(+)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index e896ca4760f6..27fa26a22a8f 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_a=
rea_struct *vma,
> >                                         enforce_sysfs, orders);
> >  }
> >
> > +enum thp_event_item {
> > +     THP_ALLOC_SUCCESS,
> > +     THP_ALLOC_FAIL,
> > +     NR_THP_EVENT_ITEMS
> > +};
> > +
> > +struct thp_event_state {
> > +     unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS];
> > +};
> > +
> > +DECLARE_PER_CPU(struct thp_event_state, thp_event_states);
> > +
> > +static inline void count_thp_event(int order, enum thp_event_item item=
)
>
> We previously concluded that we will likely want a "currently allocated T=
HPs"
> counter, which will need both increment and decrement. So perhaps

I intend to handle this via incremental patches, but it's improbable
to occur this
month due to my packed schedule with multiple tasks. Presently, my top prio=
rity
regarding counting is the fallback ratio for allocations. Following
that, it would
likely be the SWAP allocation and fallback ratio to determine if the swapfi=
le is
highly fragmented.

> "thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? A=
nd
> perhaps "event" isn't the right description either since it implies alway=
s
> counting upwards (to me at least). Although I guess you have borrowed fro=
m the
> vmstat pattern? That file has some instantaneous counters, so how does it=
 decrement?
>

Counting represents just one aspect of vmstat. In this scenario, the values=
 are
consistently increasing. For those metrics that could fluctuate in
either direction,
an entirely different mechanism is employed. These are typically managed th=
rough
global atomic operations.

/*
 * Zone and node-based page accounting with per cpu differentials.
 */
extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS];
extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS];
extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS];

void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long);
void __inc_zone_page_state(struct page *, enum zone_stat_item);
void __dec_zone_page_state(struct page *, enum zone_stat_item);

void __mod_node_page_state(struct pglist_data *, enum node_stat_item
item, long);
void __inc_node_page_state(struct page *, enum node_stat_item);
void __dec_node_page_state(struct page *, enum node_stat_item);

static inline void zone_page_state_add(long x, struct zone *zone,
                                 enum zone_stat_item item)
{
        atomic_long_add(x, &zone->vm_stat[item]);
        atomic_long_add(x, &vm_zone_stat[item]);
}

Counting is likely most efficiently handled with per-CPU copies and a summa=
tion
function to aggregate values from all CPUs.

> > +{
> > +     this_cpu_inc(thp_event_states.event[order][item]);
>
> Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would =
save 16
> bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_inde=
x()
> helper since the *_show() functions will need this too.

I actually put some words in commit messages

"The initial two unsigned longs for each event are unused, given
 that order-0 and order-1 are not mTHP. Nonetheless, this refinement
improves code clarity."

However, if conserving memory is a higher priority, I can utilize
PMD_ORDER - 1 in version 3.

>
> > +}
> > +
> >  #define transparent_hugepage_use_zero_page()                         \
> >       (transparent_hugepage_flags &                                   \
> >        (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 1683de78c313..addd093d8410 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype =3D {
> >       .sysfs_ops =3D &kobj_sysfs_ops,
> >  };
> >
> > +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) =3D {{{0}}};
> > +
> > +static unsigned long sum_thp_events(int order, enum thp_event_item ite=
m)
> > +{
> > +     unsigned long sum =3D 0;
> > +     int cpu;
> > +
> > +     for_each_online_cpu(cpu) {
> > +             struct thp_event_state *this =3D &per_cpu(thp_event_state=
s, cpu);
> > +
> > +             sum +=3D this->event[order][item];
> > +     }
> > +
> > +     return sum;
> > +}
> > +
> > +static ssize_t alloc_success_show(struct kobject *kobj,
> > +                                struct kobj_attribute *attr, char *buf=
)
> > +{
> > +     int order =3D to_thpsize(kobj)->order;
> > +
> > +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_S=
UCCESS));
> > +}
> > +
> > +static ssize_t alloc_fail_show(struct kobject *kobj,
> > +                                struct kobj_attribute *attr, char *buf=
)
> > +{
> > +     int order =3D to_thpsize(kobj)->order;
> > +
> > +     return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_F=
AIL));
> > +}
> > +
> > +static struct kobj_attribute alloc_success_attr =3D __ATTR_RO(alloc_su=
ccess);
> > +static struct kobj_attribute alloc_fail_attr =3D __ATTR_RO(alloc_fail)=
;
> > +
> > +static struct attribute *stats_attrs[] =3D {
> > +     &alloc_success_attr.attr,
> > +     &alloc_fail_attr.attr,
> > +     NULL,
> > +};
> > +
> > +static struct attribute_group stats_attr_group =3D {
> > +     .name =3D "stats",
> > +     .attrs =3D stats_attrs,
> > +};
> > +
> >  static struct thpsize *thpsize_create(int order, struct kobject *paren=
t)
> >  {
> >       unsigned long size =3D (PAGE_SIZE << order) / SZ_1K;
> > @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, s=
truct kobject *parent)
> >               return ERR_PTR(ret);
> >       }
> >
> > +     ret =3D sysfs_create_group(&thpsize->kobj, &stats_attr_group);
> > +     if (ret) {
> > +             kobject_put(&thpsize->kobj);
> > +             return ERR_PTR(ret);
> > +     }
> > +
> >       thpsize->order =3D order;
> >       return thpsize;
> >  }
> > @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_=
fault *vmf)
> >       folio =3D vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true)=
;
> >       if (unlikely(!folio)) {
> >               count_vm_event(THP_FAULT_FALLBACK);
> > +             count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL);
> >               return VM_FAULT_FALLBACK;
> >       }
> > +     count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS);
> >       return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp);
> >  }
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 984b138f85b4..c9c1031c2ecb 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_=
fault *vmf)
> >                       }
> >                       folio_throttle_swaprate(folio, gfp);
> >                       clear_huge_page(&folio->page, vmf->address, 1 << =
order);
> > +                     count_thp_event(order, THP_ALLOC_SUCCESS);
> >                       return folio;
> > +             } else {
>
> Do we need this else, given the last statement in the "if" clause is retu=
rn?

I included this "else" statement because we encounter a scenario where
we "goto next" even if allocation succeeds.

                 folio =3D vma_alloc_folio(gfp, order, vma, addr, true);
                if (folio) {
                        if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
                                folio_put(folio);
                                goto next;
                        }

After revisiting the code for a second time, I noticed that the SUCCESS
counter is incremented after the "goto" statement, so I realize that I won'=
t
need this "else" clause.

               folio =3D vma_alloc_folio(gfp, order, vma, addr, true);
                if (folio) {
                        if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
                                folio_put(folio);
                                goto next;
                        }
                        folio_throttle_swaprate(folio, gfp);
                        clear_huge_page(&folio->page, vmf->address, 1 << or=
der);
                        count_thp_event(order, THP_ALLOC_SUCCESS);
                        return folio;


So, I have two options:

1. Move the count_thp_event(order, THP_ALLOC_SUCCESS); statement ahead of
mem_cgroup_charge(folio, vma->vm_mm, gfp) and retain the else clause.
2. Keep THP_ALLOC_SUCCESS unchanged and remove the else clause.

Since mem_cgroup_charge() rarely fails, option 2 should suffice.

By the way, why do we proceed to the next order after
mem_cgroup_charge() fails?
I assume mem_cgroup_charge() is still likely to fail even after we move to
the next order?


>
> > +                     count_thp_event(order, THP_ALLOC_FAIL);
> >               }
> >  next:
> >               order =3D next_order(&orders, order);
>
> Thanks,
> Ryan
>

Thanks
Barry