From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55F42C6FD1F for ; Tue, 2 Apr 2024 09:40:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E94846B008C; Tue, 2 Apr 2024 05:40:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E1DDC6B0092; Tue, 2 Apr 2024 05:40:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CC9086B0093; Tue, 2 Apr 2024 05:40:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AA2536B008C for ; Tue, 2 Apr 2024 05:40:52 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3D65EA09F9 for ; Tue, 2 Apr 2024 09:40:52 +0000 (UTC) X-FDA: 81964097544.08.D4667EC Received: from mail-ua1-f41.google.com (mail-ua1-f41.google.com [209.85.222.41]) by imf11.hostedemail.com (Postfix) with ESMTP id 9B8C140004 for ; Tue, 2 Apr 2024 09:40:50 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PeWVAkgy; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712050850; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nt5INoP8IhjE1zWspz/OtNYRxF6tpYnnR31EKgg4t+o=; b=BRhUnk0aX9xd5nklVnA7uJ6/zUu6SVg9/JGJYwopEEsiQMYBf9+giS0m8udRJLDbPQiU/o AKkn/NGSFWn83+6B6YUMZH4rG2rYdhFEeNoIp8D8MAYuxIOcVgJR3ZBRT9idfcsctAIuI8 KxzQeIngsRHIyWxVc8P25kQmflEJKTw= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PeWVAkgy; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.41 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712050850; a=rsa-sha256; cv=none; b=7dcPAoI2xG/x6X9NqSj2qruOkv0TJKi7GrlehchJZC/sNFWyKjdZrl8MQG5COtpdc7eCs/ YR/DonB/12y18ioDMGe9yEvjRqc41verIfya4rbI8LizIwLSWnkF+jTfELEjrt47b263F3 wTGNposngNmRTQbBVgc91DAbtQN/zHg= Received: by mail-ua1-f41.google.com with SMTP id a1e0cc1a2514c-7dfacd39b9eso3610961241.1 for ; Tue, 02 Apr 2024 02:40:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1712050850; x=1712655650; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=nt5INoP8IhjE1zWspz/OtNYRxF6tpYnnR31EKgg4t+o=; b=PeWVAkgydr1c3uKe3ruZ4mCev7VQtVXBezQm7WpTsc8mwit1S0w08MP77myyYSvF5d oUsBCPc8Lpa1V9LQIIJxdBbVM9kGq1Ly28OTGQsQtOCjunF2o0NWRxvlZSRwKIYEgkx2 svMdGixuzb/u2dDDUrSoeBCbXTti01IEA3jteoEd3Pk49sfcKj/bBYd6l6TQEzyYPriQ GbbMXVlcxKo2gTdTTak9jbLV2mD+CM4HdoODar0VeuFeo+O/Sva3L38qwjJja8NK1oR7 ReVuwyOPvriFcnyDB+MRZtK3dM6yTcK1D36U7AAAIHCaD9u8/Wj0P6Ru+Ba+TTnI1bEH IDcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712050850; x=1712655650; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=nt5INoP8IhjE1zWspz/OtNYRxF6tpYnnR31EKgg4t+o=; b=GdhNxdl8EW+QKAYRObTxRkqEMSUDD6JFmpNGI6MK933hg+YMLM/P1jxsmouuG2OOFL 9Me3hJ6n3oco1UlKgWQYmo5MEaW6llO8L/1g9oLDdhGo9Dfo6AxZzR+PgEDD3+E4+8iL S303xtHBq1XAwgNzLKXwnYSdZTJO2FzmxNo1vsTFhBf54w6fVN0+euCorclbm6rwY54b +PhV9g8xIlz8QGdG+WElfSxXFnnGESIwOqisXduGH6pcn+BVsg17wktwGV6RlvfQlvOg E4OlsUB/HDrvMyocrwNwontpdP4xuDYOhDx3pOZKCAqPA0IM2orE/cJcsE2vMDNr7s+d 0GYA== X-Forwarded-Encrypted: i=1; AJvYcCWruacSfy967UNAD7Uwqlw0chY7Nrg5juZVr02tfsqsztsgkmwBjv3cIVRJUz1elsDFcUzoyk9+YoE8WcRFMOx1N8A= X-Gm-Message-State: AOJu0YxDV0iw/9/kg75+lHItK40YJU3kYymUPeHgnSj8agucbYvXOz/P z6GMB8gHsbg/MmzyVcBj9Wx0JlJkOmcTUpDW0nYEgyJ6A5EPbRiRZfAtTru9U1h+gjdf0hgQpVh Q4Qn6lu/YFQMowzIWy1GP7Dzj2cg= X-Google-Smtp-Source: AGHT+IET+aTqLnkT2NLsKYeLqei6jAqh+lBtTlvOh7VIoekBSxOvpiT4sk9WsISolrQ5+NhP0H3QbU607LUJBy81ZjA= X-Received: by 2002:ac5:cbf5:0:b0:4b9:e8bd:3b2 with SMTP id i21-20020ac5cbf5000000b004b9e8bd03b2mr7505871vkn.2.1712050849662; Tue, 02 Apr 2024 02:40:49 -0700 (PDT) MIME-Version: 1.0 References: <20240328095139.143374-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 2 Apr 2024 22:40:38 +1300 Message-ID: Subject: Re: [PATCH v2] mm: add per-order mTHP alloc_success and alloc_fail counters To: Ryan Roberts Cc: akpm@linux-foundation.org, linux-mm@kvack.org, willy@infradead.org, david@redhat.com, cerasuolodomenico@gmail.com, kasong@tencent.com, surenb@google.com, v-songbaohua@oppo.com, yosryahmed@google.com, yuzhao@google.com, chrisl@kernel.org, peterx@redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9B8C140004 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 9ox8q3du96c736wc581tsgywjbzzud41 X-HE-Tag: 1712050850-64720 X-HE-Meta: U2FsdGVkX1/tVePU7RWaAA4FV4sdtd4qgs+Cd2Y5dzxzUSZvHZuQ59lsyzfZNAiLQO1DXhRB58cJ8riKVDvTeuI6B2k6PWMiFzAURddvE5h9GdC0PDaCKl+8PXaLw5VO0Sh+/f8WGNIn297UHPhrWTRE4X7SJAh6joBqdyAC7+YDvwvGJjztPet0SFSoWd61fd1+DqH4JUvJdrcVEUW7TWuR9aaLogSUzKVaAu6EhMHRrg3eUkN4LJPLWG6ObTqaTLXxpzzm0JVwNWFloKGztJ+dwdC52fd6VHStjYihVnfZxpyjh7xb2bPKWUZ1ZdvOmWD//v3kCO9KBlu11aNE17/p5lNR8P7FVWdTsNwghXFcaMBz+6Qrk8KODOG74KIZEo1eQcT7/2WgCZPHRRfKk4YraYq8hSziJAE25gpKKq9QJk2ZCkP0cTDT8URSvcdLC2JZV08prUhNPEqUfiHi7bqC1cF+k+01mGIHe4CV5Mm4MifU8vourgIQMBPGURxN0xkK5EyGA8sf9VxMcYg97Ju8kcgLaJgtbrQlBd1QhqJdcCQxg33NSAYhXw1xdPUC4DSAaKrS5BFSHxcedSCwL3dOmScr4L5c/qmcnlkd7MPwF8NbB/OGwLMV9o5PnzPOoPiIDhXU5L3BfCs8vBn42wFvTVESTkyuxuyPG2CXe40AenX+KTejamz2MUSRDSC/vioJx5FrRCSYXn94eMv+hyYHP4JOE742nbvfDiLWyVlVy2gIxNKr/wRXhGGNbbT+gdBptHd2nL5cOyJjGcEArXtmEb/qr31Qo3e4W10mexU/0Lzw6G2735aL1xgArHNnCDi2mqbcg0hGKV7NQggjXH+htlLdsvc8skyc3uJ60DGXt4RRKOZSnzYK6YgRM4U3r+ZNR2ZuddSj1Z02ycAxPjq0Cg1+HIfF+4yChWn4pyJ3VgfYmfxIWzZN1qmyl7KZl3bMOcpdw26TtLAwatX gzENeues y/InHx/tsjM/reULXGs5zlF38GLzRT3LrpAWtJaHZoL6+RSk1YO6JXeP0qDCoztTH2tStyHDaWubs/FL7FYNCf8yGuQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Apr 2, 2024 at 9:58=E2=80=AFPM Ryan Roberts = wrote: > > On 28/03/2024 09:51, Barry Song wrote: > > From: Barry Song > > > > Profiling a system blindly with mTHP has become challenging due > > to the lack of visibility into its operations. Presenting the > > success rate of mTHP allocations appears to be pressing need. > > > > Recently, I've been experiencing significant difficulty debugging > > performance improvements and regressions without these figures. > > It's crucial for us to understand the true effectiveness of > > mTHP in real-world scenarios, especially in systems with > > fragmented memory. > > > > This patch sets up the framework for per-order mTHP counters, > > starting with the introduction of alloc_success and alloc_fail > > counters. Incorporating additional counters should now be > > straightforward as well. > > > > The initial two unsigned longs for each event are unused, given > > that order-0 and order-1 are not mTHP. Nonetheless, this refinement > > improves code clarity. > > Overall, this approach looks good to me - thanks for doing this! Thank you for reviewing! > > > > > Signed-off-by: Barry Song > > --- > > -v2: > > * move to sysfs and provide per-order counters; David, Ryan, Willy > > -v1: > > https://lore.kernel.org/linux-mm/20240326030103.50678-1-21cnbao@gmail.= com/ > > > > include/linux/huge_mm.h | 17 +++++++++++++ > > mm/huge_memory.c | 54 +++++++++++++++++++++++++++++++++++++++++ > > mm/memory.c | 3 +++ > > 3 files changed, 74 insertions(+) > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index e896ca4760f6..27fa26a22a8f 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -264,6 +264,23 @@ unsigned long thp_vma_allowable_orders(struct vm_a= rea_struct *vma, > > enforce_sysfs, orders); > > } > > > > +enum thp_event_item { > > + THP_ALLOC_SUCCESS, > > + THP_ALLOC_FAIL, > > + NR_THP_EVENT_ITEMS > > +}; > > + > > +struct thp_event_state { > > + unsigned long event[PMD_ORDER + 1][NR_THP_EVENT_ITEMS]; > > +}; > > + > > +DECLARE_PER_CPU(struct thp_event_state, thp_event_states); > > + > > +static inline void count_thp_event(int order, enum thp_event_item item= ) > > We previously concluded that we will likely want a "currently allocated T= HPs" > counter, which will need both increment and decrement. So perhaps I intend to handle this via incremental patches, but it's improbable to occur this month due to my packed schedule with multiple tasks. Presently, my top prio= rity regarding counting is the fallback ratio for allocations. Following that, it would likely be the SWAP allocation and fallback ratio to determine if the swapfi= le is highly fragmented. > "thp_event_inc()" (and eventual "thp_event_dec()" are more appropriate? A= nd > perhaps "event" isn't the right description either since it implies alway= s > counting upwards (to me at least). Although I guess you have borrowed fro= m the > vmstat pattern? That file has some instantaneous counters, so how does it= decrement? > Counting represents just one aspect of vmstat. In this scenario, the values= are consistently increasing. For those metrics that could fluctuate in either direction, an entirely different mechanism is employed. These are typically managed th= rough global atomic operations. /* * Zone and node-based page accounting with per cpu differentials. */ extern atomic_long_t vm_zone_stat[NR_VM_ZONE_STAT_ITEMS]; extern atomic_long_t vm_node_stat[NR_VM_NODE_STAT_ITEMS]; extern atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; void __mod_zone_page_state(struct zone *, enum zone_stat_item item, long); void __inc_zone_page_state(struct page *, enum zone_stat_item); void __dec_zone_page_state(struct page *, enum zone_stat_item); void __mod_node_page_state(struct pglist_data *, enum node_stat_item item, long); void __inc_node_page_state(struct page *, enum node_stat_item); void __dec_node_page_state(struct page *, enum node_stat_item); static inline void zone_page_state_add(long x, struct zone *zone, enum zone_stat_item item) { atomic_long_add(x, &zone->vm_stat[item]); atomic_long_add(x, &vm_zone_stat[item]); } Counting is likely most efficiently handled with per-CPU copies and a summa= tion function to aggregate values from all CPUs. > > +{ > > + this_cpu_inc(thp_event_states.event[order][item]); > > Could we declare as "PMD_ORDER - 1" then do [order - 2] here? That would = save 16 > bytes per CPU. (8K in a system with 512 CPUs). Possibly with a order_inde= x() > helper since the *_show() functions will need this too. I actually put some words in commit messages "The initial two unsigned longs for each event are unused, given that order-0 and order-1 are not mTHP. Nonetheless, this refinement improves code clarity." However, if conserving memory is a higher priority, I can utilize PMD_ORDER - 1 in version 3. > > > +} > > + > > #define transparent_hugepage_use_zero_page() \ > > (transparent_hugepage_flags & \ > > (1< > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index 1683de78c313..addd093d8410 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -526,6 +526,52 @@ static const struct kobj_type thpsize_ktype =3D { > > .sysfs_ops =3D &kobj_sysfs_ops, > > }; > > > > +DEFINE_PER_CPU(struct thp_event_state, thp_event_states) =3D {{{0}}}; > > + > > +static unsigned long sum_thp_events(int order, enum thp_event_item ite= m) > > +{ > > + unsigned long sum =3D 0; > > + int cpu; > > + > > + for_each_online_cpu(cpu) { > > + struct thp_event_state *this =3D &per_cpu(thp_event_state= s, cpu); > > + > > + sum +=3D this->event[order][item]; > > + } > > + > > + return sum; > > +} > > + > > +static ssize_t alloc_success_show(struct kobject *kobj, > > + struct kobj_attribute *attr, char *buf= ) > > +{ > > + int order =3D to_thpsize(kobj)->order; > > + > > + return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_S= UCCESS)); > > +} > > + > > +static ssize_t alloc_fail_show(struct kobject *kobj, > > + struct kobj_attribute *attr, char *buf= ) > > +{ > > + int order =3D to_thpsize(kobj)->order; > > + > > + return sysfs_emit(buf, "%lu\n", sum_thp_events(order, THP_ALLOC_F= AIL)); > > +} > > + > > +static struct kobj_attribute alloc_success_attr =3D __ATTR_RO(alloc_su= ccess); > > +static struct kobj_attribute alloc_fail_attr =3D __ATTR_RO(alloc_fail)= ; > > + > > +static struct attribute *stats_attrs[] =3D { > > + &alloc_success_attr.attr, > > + &alloc_fail_attr.attr, > > + NULL, > > +}; > > + > > +static struct attribute_group stats_attr_group =3D { > > + .name =3D "stats", > > + .attrs =3D stats_attrs, > > +}; > > + > > static struct thpsize *thpsize_create(int order, struct kobject *paren= t) > > { > > unsigned long size =3D (PAGE_SIZE << order) / SZ_1K; > > @@ -549,6 +595,12 @@ static struct thpsize *thpsize_create(int order, s= truct kobject *parent) > > return ERR_PTR(ret); > > } > > > > + ret =3D sysfs_create_group(&thpsize->kobj, &stats_attr_group); > > + if (ret) { > > + kobject_put(&thpsize->kobj); > > + return ERR_PTR(ret); > > + } > > + > > thpsize->order =3D order; > > return thpsize; > > } > > @@ -1050,8 +1102,10 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_= fault *vmf) > > folio =3D vma_alloc_folio(gfp, HPAGE_PMD_ORDER, vma, haddr, true)= ; > > if (unlikely(!folio)) { > > count_vm_event(THP_FAULT_FALLBACK); > > + count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_FAIL); > > return VM_FAULT_FALLBACK; > > } > > + count_thp_event(HPAGE_PMD_ORDER, THP_ALLOC_SUCCESS); > > return __do_huge_pmd_anonymous_page(vmf, &folio->page, gfp); > > } > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 984b138f85b4..c9c1031c2ecb 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4365,7 +4365,10 @@ static struct folio *alloc_anon_folio(struct vm_= fault *vmf) > > } > > folio_throttle_swaprate(folio, gfp); > > clear_huge_page(&folio->page, vmf->address, 1 << = order); > > + count_thp_event(order, THP_ALLOC_SUCCESS); > > return folio; > > + } else { > > Do we need this else, given the last statement in the "if" clause is retu= rn? I included this "else" statement because we encounter a scenario where we "goto next" even if allocation succeeds. folio =3D vma_alloc_folio(gfp, order, vma, addr, true); if (folio) { if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { folio_put(folio); goto next; } After revisiting the code for a second time, I noticed that the SUCCESS counter is incremented after the "goto" statement, so I realize that I won'= t need this "else" clause. folio =3D vma_alloc_folio(gfp, order, vma, addr, true); if (folio) { if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) { folio_put(folio); goto next; } folio_throttle_swaprate(folio, gfp); clear_huge_page(&folio->page, vmf->address, 1 << or= der); count_thp_event(order, THP_ALLOC_SUCCESS); return folio; So, I have two options: 1. Move the count_thp_event(order, THP_ALLOC_SUCCESS); statement ahead of mem_cgroup_charge(folio, vma->vm_mm, gfp) and retain the else clause. 2. Keep THP_ALLOC_SUCCESS unchanged and remove the else clause. Since mem_cgroup_charge() rarely fails, option 2 should suffice. By the way, why do we proceed to the next order after mem_cgroup_charge() fails? I assume mem_cgroup_charge() is still likely to fail even after we move to the next order? > > > + count_thp_event(order, THP_ALLOC_FAIL); > > } > > next: > > order =3D next_order(&orders, order); > > Thanks, > Ryan > Thanks Barry