From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8D87BC3DA6E
	for <linux-mm@archiver.kernel.org>; Wed, 20 Dec 2023 20:09:25 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1395D6B0074; Wed, 20 Dec 2023 15:09:25 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 0EA8E6B0075; Wed, 20 Dec 2023 15:09:25 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id ECC046B0078; Wed, 20 Dec 2023 15:09:24 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id D9F226B0074
	for <linux-mm@kvack.org>; Wed, 20 Dec 2023 15:09:24 -0500 (EST)
Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id B65C31A0A5A
	for <linux-mm@kvack.org>; Wed, 20 Dec 2023 20:09:24 +0000 (UTC)
X-FDA: 81588286248.03.6287D67
Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180])
	by imf12.hostedemail.com (Postfix) with ESMTP id D249A40008
	for <linux-mm@kvack.org>; Wed, 20 Dec 2023 20:09:21 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=PvdBxz5Z;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf12.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1703102961;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Qez7Gp7d+Qsotwhlv3ZNDbJNqFW6kpg0ErD3hIWh2H0=;
	b=LmZsiX80vndnDwvMFNf9A9MbnDB7oONDAH3hETL7215nNrKuzMRbMA4+Aw1K/apbSjHUCw
	EkOUceaVm+Cvi/Na3uPpw5MZm2/Q61T47hZGUvjzPtP+tZBR+4j2aQLHOVrxIoIjqiWcN/
	4iTt0xGpQL5M4x2DfycNEp31K5PJfEY=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=PvdBxz5Z;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf12.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703102961; a=rsa-sha256;
	cv=none;
	b=l4XKqnoEHmkNbUQMfwWISg2hKxFt7e6Fm8ERw8n3sTejYlRW+O0SmCftGa8+sVGxA0+r7O
	IjFbZrGE/bVTKHREsMC4JYM+QvHDQ1bNnDIfN/BKy7rCd6k3vqgXLvjJw6afKC6f8noc7Q
	oN+//Oa9rNJDLvPq/zym6MBMxIQLSzE=
Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-1d3e416f303so698065ad.0
        for <linux-mm@kvack.org>; Wed, 20 Dec 2023 12:09:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1703102961; x=1703707761; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=Qez7Gp7d+Qsotwhlv3ZNDbJNqFW6kpg0ErD3hIWh2H0=;
        b=PvdBxz5Z3bcDyvB0q9JYvQsBgC5/q+9rjB4fYqASUftu0Cd6tjl8FvWzNnBd+aF6Up
         LSfE3HdltP/u5ARiDIWMj3b7Dx9RIGSseC3GuSL9IKFc25wBS8ziAbRmKTNPsoSPsmr/
         vazHTA/mqT3cW6glKuiCxTPnS9Y1uI1LtDPivcIKgsn+gWHfvPUtmSW9ExjZyqCa44tS
         3NThDiBU2QtpYbhnX3yhRnBpS9rcMW7f4xflkOCSeBqN/O3PPLxEsxlB7VxhyRjx5Okh
         eP44+8RdWtIKBtu8ouJkPyt5kRrc3P7afwHrj+MbxLoZywB+CL0azKUY3HrFS+M4ABxv
         4Hhg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1703102961; x=1703707761;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=Qez7Gp7d+Qsotwhlv3ZNDbJNqFW6kpg0ErD3hIWh2H0=;
        b=sc0/ymCaY+9BQOa5fKlGk/bSzXephaO09ORBtR1+HQA/wzd8PU8OA8/vFj2mhDmR5a
         lSSZMuqJkizuvzovCLiDkbzCzQYDyIBAjJIf/4MgnZnb1ZPlNrqXoIk9YbOqm+H4HbkH
         xCTJqfOskBI4LwDfKnA5oQ210tgiZsz8svKmSftybUU0qB0bT6GjHjbDXlxn7BvTJohn
         0s6/CJ+mi96STrmYWTezCXwsTwr/D+SlehyfPmvVaR22+KqruTHJcCQQJKfzqY7OkA2i
         TeBIte0aIYmiHZQEzWlpPPygCpQK8EN6ZDnqpINmOjFEdxWfpq7mdoH0kRV6gcGJ50/h
         s8HQ==
X-Gm-Message-State: AOJu0YypG/GIkOy71IL12uMI57nFe1HpDfV0OmNZSVQeo/meGkyf7OeF
	KVkEffisncRbbR12RauAbQIP+ZW7ahjCzh07BdM=
X-Google-Smtp-Source: AGHT+IHw/fuBmFiR/+pN5MeNJO4M9Og7EJEUgysUReWzSU0vLDZMJ1aNCWDjj+0uCQMTILdf8jvviJrg9YFNpAh2tVQ=
X-Received: by 2002:a17:902:a385:b0:1d4:25a:e597 with SMTP id
 x5-20020a170902a38500b001d4025ae597mr78701pla.85.1703102960581; Wed, 20 Dec
 2023 12:09:20 -0800 (PST)
MIME-Version: 1.0
References: <202312192310.56367035-oliver.sang@intel.com> <CAHbLzkogaL-VTuZbBbPp=O8TPZxJmabJLRx1hrD-65rtbRmTtQ@mail.gmail.com>
 <ec48d168-284b-4376-97a7-090273a3ae5e@intel.com>
In-Reply-To: <ec48d168-284b-4376-97a7-090273a3ae5e@intel.com>
From: Yang Shi <shy828301@gmail.com>
Date: Wed, 20 Dec 2023 12:09:07 -0800
Message-ID: <CAHbLzkrAdZ4GrtnH0NUhwPm=gZzkaGT96xVbiyOQaJ3uCFRDnw@mail.gmail.com>
Subject: Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec
 -84.3% regression
To: Yin Fengwei <fengwei.yin@intel.com>
Cc: kernel test robot <oliver.sang@intel.com>, Rik van Riel <riel@surriel.com>, oe-lkp@lists.linux.dev, 
	lkp@intel.com, Linux Memory Management List <linux-mm@kvack.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, 
	Christopher Lameter <cl@linux.com>, ying.huang@intel.com, feng.tang@intel.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: D249A40008
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: 3esxa3wjwmdjdr57q59d6ztqbudrrhmp
X-HE-Tag: 1703102961-126237
X-HE-Meta: U2FsdGVkX1+FF3hkFmat3vpSqTS+0QAhm+YIJKzBDJ8rB+0RqGQVT3Ef/HQO+XwQ+Fns2LOtiYW6Y3lU32ijNE1VEsjpPp0INW1AINJHqk959ZgPZFj9REhmMsIOTFJEFCOJPOU2mERwYh8t5yC6DfJhG07TnRKC0zwuLV095ILAEU9tOvEVGd60VPtY0ITdAleKuBxNqpMH9zImfOg6Ik8IrvxiJpEUDcjWe0pbwHKWBrj93jNJCkutmZLTqiQzdmFRR/M36+3CG0sIojped9usz54Bhss62SqSEaiRgCjFKQqoP4DV3PJOYgx1t/tq78sB+s1PJJbpABk5ITCSQJy6HnBYFSVADYyE+oI88J94iz/Wx43wTt32qq21W7KG1Ihkr8zWMHHA5/La61vkiPsdKbxxxrW8DUPIA4KnyfwiB6qAh/Kl+WxjrZwy9hPxPEw3vVz2dADZ8x7tQ0QGq775rjfOKpCRYi1QkA9O5vueC5OJAWreJMZq3xQ2DQlPE6uMIwy3EyOd9PQjHrt9hkdEiaMXMycS4RDEE9IF31bYvUkuviydqeeu2RpHMCrPKjSrob3DtzSnHUlPsyA5MV3HJcChGt8xnhasFmS7Z7dVNoq2OsSM7QsI0jdLVgNTGwe7VXmlnZFB9xhlwBP+jXSPmv1i9bm7ZHDvZWap8Y0Yl+j7rjUwtSSTFPgRdtcYT4EZYtf+8r0S3g54YZJZJIhbaagAWzQWdzC82nSMwzP/8nRRsIZLlwR9Mm82Osft13U5qvc3UCPy0dLnCU7SKYUN20f++kiUliOSf6HmRIj9R8lck2it5hJ4OPfMaGTEbEoVy28erWmpLHH+yt3dCvkrwFv10XxNyiql21pAM0v+ZyJdUEFZSJcfYU10kbOuHxYSHNoXNdViqoJBF4qFr/cyo9fgMI+ggpOEdazz5TXb6czZvmaw8BonJa//50C89WyrxJ4cH3I+YG3ejRa
 Ac75+gDv
 OvmEstR5XTS2UkXNKA0iWQH2KyqDxrbZvwMK2rn2JI7650pd+/nZBMWe8Sbw9Y1jrtMKsmIsuC9aWFXDkUzmFmM8Ic2ExSziOw6msqzIyf3VbgB7meZwyBVYw9UfHS6RO3Psulw1ZA9N1kKZOJnIyfGTaH2ZXau2cHiC3f8GOjq4xg1ZfKc8xqX+M3NA5Q+99eqKD/K5e50WQbiEcgZkWvTfoe8FoIsOAjx+CXvCnagha8ZFkRu56CDStGjASu5AnMsDvLBmOE8VafZgqwvAWyvfr/9X81SeWMGAKcH6Bju1KOeOP8nrYrj0WpR3ndal+jo8CN7CjhHYZYcXmGYOHioez5+gDhx18vM8GSIwZmehgPiENFPX5xlThyOX46aV67+Ffbw7rZYTfOMx+QDJ3OFz8PZ7JJc+oHQNmn19wFVEyT8zZEo//+KN/iRhREb3GmdqUIhXQzigWA5qMkbwmDoIWmXabwDuHoVHhwv7hJKEhMacDb+8TWg6z9GnAGuV9E3mFuUbvvZf+BFtVGzsppSWl33IIAfAZ+vzEGt5HpxjICGxs9UHS+Zeq0ZjLQHcmRCw03KWhSCXWx/akJ7tLN7cJOrEfHbGpQ0KthAkKRic9FIk3GDXlKYXV+1k/hzrmcejT1+PJF0wCz4s2njHHvdkn7kn8nQflyiH/z1kksabuBex7ifHh19Va6g==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Dec 20, 2023 at 12:34=E2=80=AFAM Yin Fengwei <fengwei.yin@intel.com=
> wrote:
>
>
>
> On 2023/12/20 13:27, Yang Shi wrote:
> > On Tue, Dec 19, 2023 at 7:41=E2=80=AFAM kernel test robot <oliver.sang@=
intel.com> wrote:
> >>
> >>
> >>
> >> Hello,
> >>
> >> for this commit, we reported
> >> "[mm]  96db82a66d:  will-it-scale.per_process_ops -95.3% regression"
> >> in Aug, 2022 when it's in linux-next/master
> >> https://lore.kernel.org/all/YwIoiIYo4qsYBcgd@xsang-OptiPlex-9020/
> >>
> >> later, we reported
> >> "[mm] f35b5d7d67: will-it-scale.per_process_ops -95.5% regression"
> >> in Oct, 2022 when it's in linus/master
> >> https://lore.kernel.org/all/202210181535.7144dd15-yujie.liu@intel.com/
> >>
> >> and the commit was reverted finally by
> >> commit 0ba09b1733878afe838fe35c310715fda3d46428
> >> Author: Linus Torvalds <torvalds@linux-foundation.org>
> >> Date:   Sun Dec 4 12:51:59 2022 -0800
> >>
> >> now we noticed it goes into linux-next/master again.
> >>
> >> we are not sure if there is an agreement that the benefit of this comm=
it
> >> has already overweight performance drop in some mirco benchmark.
> >>
> >> we also noticed from https://lore.kernel.org/all/20231214223423.113307=
4-1-yang@os.amperecomputing.com/
> >> that
> >> "This patch was applied to v6.1, but was reverted due to a regression
> >> report.  However it turned out the regression was not due to this patc=
h.
> >> I ping'ed Andrew to reapply this patch, Andrew may forget it.  This
> >> patch helps promote THP, so I rebased it onto the latest mm-unstable."
> >
> > IIRC, Huang Ying's analysis showed the regression for will-it-scale
> > micro benchmark is fine, it was actually reverted due to kernel build
> > regression with LLVM reported by Nathan Chancellor. Then the
> > regression was resolved by commit
> > 81e506bec9be1eceaf5a2c654e28ba5176ef48d8 ("mm/thp: check and bail out
> > if page in deferred queue already"). And this patch did improve kernel
> > build with GCC by ~3% if I remember correctly.
> >
> >>
> >> however, unfortunately, in our latest tests, we still observed below r=
egression
> >> upon this commit. just FYI.
> >>
> >>
> >>
> >> kernel test robot noticed a -84.3% regression of stress-ng.pthread.ops=
_per_sec on:
> >
> > Interesting, wasn't the same regression seen last time? And I'm a
> > little bit confused about how pthread got regressed. I didn't see the
> > pthread benchmark do any intensive memory alloc/free operations. Do
> > the pthread APIs do any intensive memory operations? I saw the
> > benchmark does allocate memory for thread stack, but it should be just
> > 8K per thread, so it should not trigger what this patch does. With
> > 1024 threads, the thread stacks may get merged into one single VMA (8M
> > total), but it may do so even though the patch is not applied.
> stress-ng.pthread test code is strange here:
>
> https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#L5=
73
>
> Even it allocates its own stack, but that attr is not passed
> to pthread_create. So it's still glibc to allocate stack for
> pthread which is 8M size. This is why this patch can impact
> the stress-ng.pthread testing.

Aha, nice catch, I overlooked that.

>
>
> My understanding is this is different regression (if it's a valid
> regression). The previous hotspot was in:
>     deferred_split_huge_page
>        deferred_split_huge_page
>           deferred_split_huge_page
>              spin_lock
>
> while this time, the hotspot is in (pmd_lock from do_madvise I suppose):
>     - 55.02% zap_pmd_range.isra.0
>        - 53.42% __split_huge_pmd
>           - 51.74% _raw_spin_lock
>              - 51.73% native_queued_spin_lock_slowpath
>                 + 3.03% asm_sysvec_call_function
>           - 1.67% __split_huge_pmd_locked
>              - 0.87% pmdp_invalidate
>                 + 0.86% flush_tlb_mm_range
>        - 1.60% zap_pte_range
>           - 1.04% page_remove_rmap
>                0.55% __mod_lruvec_page_state
>
>
> >
> >>
> >>
> >> commit: 1111d46b5cbad57486e7a3fab75888accac2f072 ("mm: align larger an=
onymous mappings on THP boundaries")
> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git maste=
r
> >>
> >> testcase: stress-ng
> >> test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU @ =
3.00GHz (Cascade Lake) with 128G memory
> >> parameters:
> >>
> >>          nr_threads: 1
> >>          disk: 1HDD
> >>          testtime: 60s
> >>          fs: ext4
> >>          class: os
> >>          test: pthread
> >>          cpufreq_governor: performance
> >>
> >>
> >> In addition to that, the commit also has significant impact on the fol=
lowing tests:
> >>
> >> +------------------+--------------------------------------------------=
---------------------------------------------+
> >> | testcase: change | stream: stream.triad_bandwidth_MBps -12.1% regres=
sion                                         |
> >> | test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum 8=
480CTDX (Sapphire Rapids) with 512G memory   |
> >> | test parameters  | array_size=3D50000000                            =
                                               |
> >> |                  | cpufreq_governor=3Dperformance                   =
                                               |
> >> |                  | iterations=3D10x                                 =
                                               |
> >> |                  | loop=3D100                                       =
                                               |
> >> |                  | nr_threads=3D25%                                 =
                                               |
> >> |                  | omp=3Dtrue                                       =
                                               |
> >> +------------------+--------------------------------------------------=
---------------------------------------------+
> >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed=
.Average.Integer.mb_s -3.5% regression       |
> >> | test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CP=
U @ 3.20GHz (Coffee Lake) with 16G memory    |
> >> | test parameters  | cpufreq_governor=3Dperformance                   =
                                               |
> >> |                  | option_a=3DAverage                               =
                                               |
> >> |                  | option_b=3DInteger                               =
                                               |
> >> |                  | test=3Dramspeed-1.4.3                            =
                                               |
> >> +------------------+--------------------------------------------------=
---------------------------------------------+
> >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspeed=
.Average.FloatingPoint.mb_s -3.0% regression |
> >> | test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 CP=
U @ 3.20GHz (Coffee Lake) with 16G memory    |
> >> | test parameters  | cpufreq_governor=3Dperformance                   =
                                               |
> >> |                  | option_a=3DAverage                               =
                                               |
> >> |                  | option_b=3DFloating Point                        =
                                               |
> >> |                  | test=3Dramspeed-1.4.3                            =
                                               |
> >> +------------------+--------------------------------------------------=
---------------------------------------------+
> >>
> >>
> >> If you fix the issue in a separate patch/commit (i.e. not just a new v=
ersion of
> >> the same patch/commit), kindly add following tags
> >> | Reported-by: kernel test robot <oliver.sang@intel.com>
> >> | Closes: https://lore.kernel.org/oe-lkp/202312192310.56367035-oliver.=
sang@intel.com
> >>
> >>
> >> Details are as below:
> >> ----------------------------------------------------------------------=
---------------------------->
> >>
> >>
> >> The kernel config and materials to reproduce are available at:
> >> https://download.01.org/0day-ci/archive/20231219/202312192310.56367035=
-oliver.sang@intel.com
> >>
> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >> class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tbox=
_group/test/testcase/testtime:
> >>    os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/1/debian-11.1-x86_6=
4-20220510.cgz/lkp-csl-d02/pthread/stress-ng/60s
> >>
> >> commit:
> >>    30749e6fbb ("mm/memory: replace kmap() with kmap_local_page()")
> >>    1111d46b5c ("mm: align larger anonymous mappings on THP boundaries"=
)
> >>
> >> 30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75
> >> ---------------- ---------------------------
> >>           %stddev     %change         %stddev
> >>               \          |                \
> >>    13405796           -65.5%    4620124        cpuidle..usage
> >>        8.00            +8.2%       8.66 =C4=85  2%  iostat.cpu.system
> >>        1.61           -60.6%       0.63        iostat.cpu.user
> >>      597.50 =C4=85 14%     -64.3%     213.50 =C4=85 14%  perf-c2c.DRAM=
.local
> >>        1882 =C4=85 14%     -74.7%     476.83 =C4=85  7%  perf-c2c.HITM=
.local
> >>     3768436           -12.9%    3283395        vmstat.memory.cache
> >>      355105           -75.7%      86344 =C4=85  3%  vmstat.system.cs
> >>      385435           -20.7%     305714 =C4=85  3%  vmstat.system.in
> >>        1.13            -0.2        0.88        mpstat.cpu.all.irq%
> >>        0.29            -0.2        0.10 =C4=85  2%  mpstat.cpu.all.sof=
t%
> >>        6.76 =C4=85  2%      +1.1        7.88 =C4=85  2%  mpstat.cpu.al=
l.sys%
> >>        1.62            -1.0        0.62 =C4=85  2%  mpstat.cpu.all.usr=
%
> >>     2234397           -84.3%     350161 =C4=85  5%  stress-ng.pthread.=
ops
> >>       37237           -84.3%       5834 =C4=85  5%  stress-ng.pthread.=
ops_per_sec
> >>      294706 =C4=85  2%     -68.0%      94191 =C4=85  6%  stress-ng.tim=
e.involuntary_context_switches
> >>       41442 =C4=85  2%   +5023.4%    2123284        stress-ng.time.max=
imum_resident_set_size
> >>     4466457           -83.9%     717053 =C4=85  5%  stress-ng.time.min=
or_page_faults
> >
> > The larger RSS and fewer page faults are expected.
> >
> >>      243.33           +13.5%     276.17 =C4=85  3%  stress-ng.time.per=
cent_of_cpu_this_job_got
> >>      131.64           +27.7%     168.11 =C4=85  3%  stress-ng.time.sys=
tem_time
> >>       19.73           -82.1%       3.53 =C4=85  4%  stress-ng.time.use=
r_time
> >
> > Much less user time. And it seems to match the drop of the pthread metr=
ic.
> >
> >>     7715609           -80.2%    1530125 =C4=85  4%  stress-ng.time.vol=
untary_context_switches
> >>       76728           -80.8%      14724 =C4=85  4%  perf-stat.i.minor-=
faults
> >>     5600408           -61.4%    2160997 =C4=85  5%  perf-stat.i.node-l=
oads
> >>     8873996           +52.1%   13499744 =C4=85  5%  perf-stat.i.node-s=
tores
> >>      112409           -81.9%      20305 =C4=85  4%  perf-stat.i.page-f=
aults
> >>        2.55           +89.6%       4.83        perf-stat.overall.MPKI
> >
> > Much more TLB misses.
> >
> >>        1.51            -0.4        1.13        perf-stat.overall.branc=
h-miss-rate%
> >>       19.26           +24.5       43.71        perf-stat.overall.cache=
-miss-rate%
> >>        1.70           +56.4%       2.65        perf-stat.overall.cpi
> >>      665.84           -17.5%     549.51 =C4=85  2%  perf-stat.overall.=
cycles-between-cache-misses
> >>        0.12 =C4=85  4%      -0.1        0.04        perf-stat.overall.=
dTLB-load-miss-rate%
> >>        0.08 =C4=85  2%      -0.0        0.03        perf-stat.overall.=
dTLB-store-miss-rate%
> >>       59.16            +0.9       60.04        perf-stat.overall.iTLB-=
load-miss-rate%
> >>        1278           +86.1%       2379 =C4=85  2%  perf-stat.overall.=
instructions-per-iTLB-miss
> >>        0.59           -36.1%       0.38        perf-stat.overall.ipc
> >
> > Worse IPC and CPI.
> >
> >>   2.078e+09           -48.3%  1.074e+09 =C4=85  4%  perf-stat.ps.branc=
h-instructions
> >>    31292687           -61.2%   12133349 =C4=85  2%  perf-stat.ps.branc=
h-misses
> >>    26057291            -5.9%   24512034 =C4=85  4%  perf-stat.ps.cache=
-misses
> >>   1.353e+08           -58.6%   56072195 =C4=85  4%  perf-stat.ps.cache=
-references
> >>      365254           -75.8%      88464 =C4=85  3%  perf-stat.ps.conte=
xt-switches
> >>   1.735e+10           -22.4%  1.346e+10 =C4=85  2%  perf-stat.ps.cpu-c=
ycles
> >>       60838           -79.1%      12727 =C4=85  6%  perf-stat.ps.cpu-m=
igrations
> >>     3056601 =C4=85  4%     -81.5%     565354 =C4=85  4%  perf-stat.ps.=
dTLB-load-misses
> >>   2.636e+09           -50.7%    1.3e+09 =C4=85  4%  perf-stat.ps.dTLB-=
loads
> >>     1155253 =C4=85  2%     -83.0%     196581 =C4=85  5%  perf-stat.ps.=
dTLB-store-misses
> >>   1.473e+09           -57.4%  6.268e+08 =C4=85  3%  perf-stat.ps.dTLB-=
stores
> >>     7997726           -73.3%    2131477 =C4=85  3%  perf-stat.ps.iTLB-=
load-misses
> >>     5521346           -74.3%    1418623 =C4=85  2%  perf-stat.ps.iTLB-=
loads
> >>   1.023e+10           -50.4%  5.073e+09 =C4=85  4%  perf-stat.ps.instr=
uctions
> >>       75671           -80.9%      14479 =C4=85  4%  perf-stat.ps.minor=
-faults
> >>     5549722           -61.4%    2141750 =C4=85  4%  perf-stat.ps.node-=
loads
> >>     8769156           +51.6%   13296579 =C4=85  5%  perf-stat.ps.node-=
stores
> >>      110795           -82.0%      19977 =C4=85  4%  perf-stat.ps.page-=
faults
> >>   6.482e+11           -50.7%  3.197e+11 =C4=85  4%  perf-stat.total.in=
structions
> >>        0.00 =C4=85 37%    -100.0%       0.00        perf-sched.sch_del=
ay.avg.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc_node.memcg_alloc=
_slab_cgroups.allocate_slab
> >>        0.01 =C4=85 18%   +8373.1%       0.73 =C4=85 49%  perf-sched.sc=
h_delay.avg.ms.__cond_resched.down_read.do_madvise.__x64_sys_madvise.do_sys=
call_64
> >>        0.01 =C4=85 16%   +4600.0%       0.38 =C4=85 24%  perf-sched.sc=
h_delay.avg.ms.__cond_resched.down_read.exit_mm.do_exit.__x64_sys_exit
> >
> > More time spent in madvise and munmap. but I'm not sure whether this
> > is caused by tearing down the address space when exiting the test. If
> > so it should not count in the regression.
> It's not for the whole address space tearing down. It's for pthread
> stack tearing down when pthread exit (can be treated as address space
> tearing down? I suppose so).
>
> https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384
> https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576

It explains the problem. The madvise() does have some extra overhead
for handling THP (splitting pmd, deferred split queue, etc).

>
> Another thing is whether it's worthy to make stack use THP? It may be
> useful for some apps which need large stack size?

Kernel actually doesn't apply THP to stack (see
vma_is_temporary_stack()). But kernel can't know whether the VMA is
stack or not by checking VM_GROWSDOWN | VM_GROWSUP flags. So if glibc
doesn't set the proper flags to tell kernel the area is stack, kernel
just treats it as normal anonymous area. So glibc should set up stack
properly IMHO.

>
>
> Regards
> Yin, Fengwei