From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 78BA5C3DA6E
	for <linux-mm@archiver.kernel.org>; Thu, 21 Dec 2023 00:26:26 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0EBAC6B0074; Wed, 20 Dec 2023 19:26:26 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 09B8C6B0083; Wed, 20 Dec 2023 19:26:26 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E30486B0085; Wed, 20 Dec 2023 19:26:25 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id CCC6E6B0083
	for <linux-mm@kvack.org>; Wed, 20 Dec 2023 19:26:25 -0500 (EST)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 9E2E8A05C5
	for <linux-mm@kvack.org>; Thu, 21 Dec 2023 00:26:25 +0000 (UTC)
X-FDA: 81588933930.20.BA93240
Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47])
	by imf08.hostedemail.com (Postfix) with ESMTP id BA69E160013
	for <linux-mm@kvack.org>; Thu, 21 Dec 2023 00:26:23 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Vnrx8TXb;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1703118383;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=mozwkjQZKTmY7+GKNLqKxkYYBFX5aqJHQCPsVvwkPaw=;
	b=1CjsK1UlIQ5pm8STiW+q9iYYn+gtfbBdYBlIq00KG3L8cTb117+kwyAxOlPJLuFvekeWS+
	3w86OyRapwtZm1oWnELZBaPd+y3hRDZj9bZnbwtBtxOjFP0/D1xbJd0OIwp3sgY4w5nbE4
	JqRaLI5PPSSYi5Rb8dvdPOkIjUI4t10=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Vnrx8TXb;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703118383; a=rsa-sha256;
	cv=none;
	b=Ufm0TdmA8qWksQdDVp/53WuN6bqbWHfprLwuMa5diehTyj43G/U+BpauKJj78Aea9K03WH
	vBK54sWeKYYNxKv+zCogYEzW62wRIko/KKFgNZSiELl9tWrWBhpy7jhvoxyQnYDVgzlFKF
	8cgRXPOjwwtZNM1hNwgfVdR1eUTb69k=
Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-28b09aeca73so178658a91.1
        for <linux-mm@kvack.org>; Wed, 20 Dec 2023 16:26:23 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1703118382; x=1703723182; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=mozwkjQZKTmY7+GKNLqKxkYYBFX5aqJHQCPsVvwkPaw=;
        b=Vnrx8TXbZYuFlEMOJgVpZxFUcOT75fyqQbZxTcWkoWnpmMfrX0GChvDe0F9dglfTNq
         +rTBX4T0ZlZwHqDhDxZSS+/EQBgPTmzUUdFQwBXsB7CUGlFmXZMI5UbKDtULLgnKbnV0
         0HpPO94Rg5TvekZB3bDRaZEMWXcRIon4Kn54MIPJQX1qZk75S6Qg35i8W5hyU4wL8TO/
         keedTK+jO0CLMO+GSbJ3HwoGOktD/H8yjZm0CJKDClHBu/p+702nZlWBWup2iRlrHXFC
         /Dho4nG+FXp9H2VIqEkaLNcahBYYswX+rvzhel1XaqDehSvCGilrygTUDkaYqqyUIkMV
         TY8A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1703118382; x=1703723182;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=mozwkjQZKTmY7+GKNLqKxkYYBFX5aqJHQCPsVvwkPaw=;
        b=CMYw/hxBvLOtKv9krwfqX6gitSBo9w7hVu7hvDLvcblwfKFq8/GFcse8ZSzWPPVUGM
         nnL4PwP8DmdI48dk0kCLMh8JAWHtLoTGiCmaS2jh53wap1S5afCWvo0TtZUxu/mE6r88
         uBsCU3rDOqilkVBNMBX0966AGOHWREjSefAolgvLQftuktKq11xm/f0PIiHDwXIXymN+
         CjWKZHwNk9NOLbL/HqDEiihv2HzXvYDT1bh2x6/PQewEqXuPMAYgJPKnsl7fG18Rqbx7
         Cl/DE9njx/w8DCYGGHZDka6F/qA38nYoSPjQ9K9CyMGy0UJTjCIbWEJOq7/73XNMDsL+
         1D5g==
X-Gm-Message-State: AOJu0YzXhLoGiGqWV+FN9LJ4wRRQmb2kWiJT0aj037lY3C69m5bR6Pqb
	kclQ4uz7KCV2sd0igwxEQ17PgQ42vULXoDys4mU=
X-Google-Smtp-Source: AGHT+IF3HH/bDRiHBA+RCl4ndXEW8MY4WBZo4telT2vRPtbC5PT0UocC4thyq+4qhL4JlmesXYgOyiwtdCjbykP9ScY=
X-Received: by 2002:a17:90a:3006:b0:28b:e278:f4c4 with SMTP id
 g6-20020a17090a300600b0028be278f4c4mr1230907pjb.34.1703118382445; Wed, 20 Dec
 2023 16:26:22 -0800 (PST)
MIME-Version: 1.0
References: <202312192310.56367035-oliver.sang@intel.com> <CAHbLzkogaL-VTuZbBbPp=O8TPZxJmabJLRx1hrD-65rtbRmTtQ@mail.gmail.com>
 <ec48d168-284b-4376-97a7-090273a3ae5e@intel.com> <CAHbLzkrAdZ4GrtnH0NUhwPm=gZzkaGT96xVbiyOQaJ3uCFRDnw@mail.gmail.com>
In-Reply-To: <CAHbLzkrAdZ4GrtnH0NUhwPm=gZzkaGT96xVbiyOQaJ3uCFRDnw@mail.gmail.com>
From: Yang Shi <shy828301@gmail.com>
Date: Wed, 20 Dec 2023 16:26:10 -0800
Message-ID: <CAHbLzkrd-y2=KHS-zreH8FUmQtOaf_FGuyj78zNNH3FbpCQNnA@mail.gmail.com>
Subject: Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec
 -84.3% regression
To: Yin Fengwei <fengwei.yin@intel.com>
Cc: kernel test robot <oliver.sang@intel.com>, Rik van Riel <riel@surriel.com>, oe-lkp@lists.linux.dev, 
	lkp@intel.com, Linux Memory Management List <linux-mm@kvack.org>, 
	Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, 
	Christopher Lameter <cl@linux.com>, ying.huang@intel.com, feng.tang@intel.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: BA69E160013
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: ctpjnw7bjpzb5cio3i8h36kkmosgh9zh
X-HE-Tag: 1703118383-807892
X-HE-Meta: U2FsdGVkX1/ZHZFu83qnpb3n/1k5hOoo7Ud3E0K6FlVVma5IyUoeWzmjbrY4llIhcrDuGRUbVCWYjguBcsaiHmMDvCwW/hkuL5bPctNUQO56nuVrfGDpxrL7ms7KYQqWVw/bcA9ioWQUqWMumaQLFD6LMKQgVAMFpYLTT+CkPXo8aFaybUCYDLE8OYC+zx4BGK9GY1SLCagmBNrPskDJZSCrQKIL0vswQiLIShjDvRLHxcInws99SbL+CqyU2Nyxwzg/+aTUcKmxhAZexiszarm+fmHgiKj/vMcTYmh5Xqc7MwobVzwCwT2cw5Ty6f4IM/z46hHyB9smuLkqZzTePmsqcp3QpO0WM3qI085wSntCI/un6cWP5WZeHXfAUa0hLS0UivzXw/Jxpd8Orr+SOeVhyGemOcA3ukykHeIT2s01Rpl4yT320pEbLWeCaViF8Y/JqmezmPlHLzRyXdiYx3tNc7v6IEyN7AMGb5pI4/llO96Qay4GaXUIdPY6IWARCCacPT9i2J5aYMGGz6+nIcdZG7vUvUkcpn3FmetFN0O4Gi4YvNUhe0mpRX5X/3OlNgCOY4BPpNsOTzg0XfEkd1aBwhSjFvkNOdUN21KgEXoexa7PC3ZFjAYjIkcmPbPJRY9S0YDMz+bJg0vwJcDUyGODqJI2Vl4PKmeCX89+LyeKhwXl05ybl2phvpS0vSOZJsqrxJqbOs4uA1hhTwfAeQEtujGe4vE+2hJPlZGKN1btmNNRHdQb+pNxrfAouv0Yf6zRwGe+Ro3U7tSVQrTOKZHeEVqrRR2gvteJBq9pJ5DMFPcpSMFMOnyS+31TpZhgUXLzlAAbntkdjOkqSpKhW0if6UDxy+OPU4UTl4eiZ8Z+RQh7e7I240n/2hx1VdR4UUvfMO0Lz0tITiGMW5bI5c7MvFZL8XfoOWKLGN8+O+vfaYr/vdJ1yrjmeQaR7jK3r07TReauvxm2P7R99sL
 cbWL5qKn
 0iiSPc57m3crW5SNy1txqjX+Ij5ChsGSbSgy3RKYcXr50MDfcNEhVoZfWnEwjvlAXKr776fmfZc7M1t9KCtLDa+bYVZr3+w/CT6oAMYLaptZsCqkpe++SqvOAXzhz5OgEraSu8Ap07dOCQzQshReAX43jUc3VoWmmbLXuUgC4XjOPnmyU15pw6ZOHRNqwQxIZSjBbVrLwINXhDMuEPSLgOfeBDhMw5xrL6xOVhWyY9b7ixeP1y7zt3izhWwiCK5yIlamE/f9yVBc1mt4Zy77b6txOX5HnbdPRSmBt5XBH4hPQnPZFr/iiKgTx9KLkqZtctKQ6hzdAgjHk+1WiNVGCfiGRwvnR7O7s34Q7MqedZHiSUH6/KHn/C11web9Jc3ur1qfni5Uaabh+AMlxibpzliaxr2k/EiLuhJSlkn2s98XJerkGl/GV/YIIj0LHBH0EHwnmx4bSYnfC3rckL+iwocjtfmiwUZA0DNMxWHXUgykoxv41N6qPt6H0rbhCOtcmztB90q7QrwldsBVMbT+uE7ahyDjofnpS6rZ39MjEB8b0Nv+B16QBWNrvC8iI9zfGFEIV9swh/eABJLCNKq65UtojDGwMY9VMcy5ibJfmTjfN4QscWjKwU44ipQ1/e1Eg2ZqBjNHqESPaiTR18fuhjA0rd5H6kaJVQpiwDs7ECX/k6Hwxv8WdDoF2Qd2dALasMHhn78wLicg9lrE=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Dec 20, 2023 at 12:09=E2=80=AFPM Yang Shi <shy828301@gmail.com> wro=
te:
>
> On Wed, Dec 20, 2023 at 12:34=E2=80=AFAM Yin Fengwei <fengwei.yin@intel.c=
om> wrote:
> >
> >
> >
> > On 2023/12/20 13:27, Yang Shi wrote:
> > > On Tue, Dec 19, 2023 at 7:41=E2=80=AFAM kernel test robot <oliver.san=
g@intel.com> wrote:
> > >>
> > >>
> > >>
> > >> Hello,
> > >>
> > >> for this commit, we reported
> > >> "[mm]  96db82a66d:  will-it-scale.per_process_ops -95.3% regression"
> > >> in Aug, 2022 when it's in linux-next/master
> > >> https://lore.kernel.org/all/YwIoiIYo4qsYBcgd@xsang-OptiPlex-9020/
> > >>
> > >> later, we reported
> > >> "[mm] f35b5d7d67: will-it-scale.per_process_ops -95.5% regression"
> > >> in Oct, 2022 when it's in linus/master
> > >> https://lore.kernel.org/all/202210181535.7144dd15-yujie.liu@intel.co=
m/
> > >>
> > >> and the commit was reverted finally by
> > >> commit 0ba09b1733878afe838fe35c310715fda3d46428
> > >> Author: Linus Torvalds <torvalds@linux-foundation.org>
> > >> Date:   Sun Dec 4 12:51:59 2022 -0800
> > >>
> > >> now we noticed it goes into linux-next/master again.
> > >>
> > >> we are not sure if there is an agreement that the benefit of this co=
mmit
> > >> has already overweight performance drop in some mirco benchmark.
> > >>
> > >> we also noticed from https://lore.kernel.org/all/20231214223423.1133=
074-1-yang@os.amperecomputing.com/
> > >> that
> > >> "This patch was applied to v6.1, but was reverted due to a regressio=
n
> > >> report.  However it turned out the regression was not due to this pa=
tch.
> > >> I ping'ed Andrew to reapply this patch, Andrew may forget it.  This
> > >> patch helps promote THP, so I rebased it onto the latest mm-unstable=
."
> > >
> > > IIRC, Huang Ying's analysis showed the regression for will-it-scale
> > > micro benchmark is fine, it was actually reverted due to kernel build
> > > regression with LLVM reported by Nathan Chancellor. Then the
> > > regression was resolved by commit
> > > 81e506bec9be1eceaf5a2c654e28ba5176ef48d8 ("mm/thp: check and bail out
> > > if page in deferred queue already"). And this patch did improve kerne=
l
> > > build with GCC by ~3% if I remember correctly.
> > >
> > >>
> > >> however, unfortunately, in our latest tests, we still observed below=
 regression
> > >> upon this commit. just FYI.
> > >>
> > >>
> > >>
> > >> kernel test robot noticed a -84.3% regression of stress-ng.pthread.o=
ps_per_sec on:
> > >
> > > Interesting, wasn't the same regression seen last time? And I'm a
> > > little bit confused about how pthread got regressed. I didn't see the
> > > pthread benchmark do any intensive memory alloc/free operations. Do
> > > the pthread APIs do any intensive memory operations? I saw the
> > > benchmark does allocate memory for thread stack, but it should be jus=
t
> > > 8K per thread, so it should not trigger what this patch does. With
> > > 1024 threads, the thread stacks may get merged into one single VMA (8=
M
> > > total), but it may do so even though the patch is not applied.
> > stress-ng.pthread test code is strange here:
> >
> > https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#=
L573
> >
> > Even it allocates its own stack, but that attr is not passed
> > to pthread_create. So it's still glibc to allocate stack for
> > pthread which is 8M size. This is why this patch can impact
> > the stress-ng.pthread testing.
>
> Aha, nice catch, I overlooked that.
>
> >
> >
> > My understanding is this is different regression (if it's a valid
> > regression). The previous hotspot was in:
> >     deferred_split_huge_page
> >        deferred_split_huge_page
> >           deferred_split_huge_page
> >              spin_lock
> >
> > while this time, the hotspot is in (pmd_lock from do_madvise I suppose)=
:
> >     - 55.02% zap_pmd_range.isra.0
> >        - 53.42% __split_huge_pmd
> >           - 51.74% _raw_spin_lock
> >              - 51.73% native_queued_spin_lock_slowpath
> >                 + 3.03% asm_sysvec_call_function
> >           - 1.67% __split_huge_pmd_locked
> >              - 0.87% pmdp_invalidate
> >                 + 0.86% flush_tlb_mm_range
> >        - 1.60% zap_pte_range
> >           - 1.04% page_remove_rmap
> >                0.55% __mod_lruvec_page_state
> >
> >
> > >
> > >>
> > >>
> > >> commit: 1111d46b5cbad57486e7a3fab75888accac2f072 ("mm: align larger =
anonymous mappings on THP boundaries")
> > >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git mas=
ter
> > >>
> > >> testcase: stress-ng
> > >> test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU =
@ 3.00GHz (Cascade Lake) with 128G memory
> > >> parameters:
> > >>
> > >>          nr_threads: 1
> > >>          disk: 1HDD
> > >>          testtime: 60s
> > >>          fs: ext4
> > >>          class: os
> > >>          test: pthread
> > >>          cpufreq_governor: performance
> > >>
> > >>
> > >> In addition to that, the commit also has significant impact on the f=
ollowing tests:
> > >>
> > >> +------------------+------------------------------------------------=
-----------------------------------------------+
> > >> | testcase: change | stream: stream.triad_bandwidth_MBps -12.1% regr=
ession                                         |
> > >> | test machine     | 224 threads 2 sockets Intel(R) Xeon(R) Platinum=
 8480CTDX (Sapphire Rapids) with 512G memory   |
> > >> | test parameters  | array_size=3D50000000                          =
                                                 |
> > >> |                  | cpufreq_governor=3Dperformance                 =
                                                 |
> > >> |                  | iterations=3D10x                               =
                                                 |
> > >> |                  | loop=3D100                                     =
                                                 |
> > >> |                  | nr_threads=3D25%                               =
                                                 |
> > >> |                  | omp=3Dtrue                                     =
                                                 |
> > >> +------------------+------------------------------------------------=
-----------------------------------------------+
> > >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspe=
ed.Average.Integer.mb_s -3.5% regression       |
> > >> | test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 =
CPU @ 3.20GHz (Coffee Lake) with 16G memory    |
> > >> | test parameters  | cpufreq_governor=3Dperformance                 =
                                                 |
> > >> |                  | option_a=3DAverage                             =
                                                 |
> > >> |                  | option_b=3DInteger                             =
                                                 |
> > >> |                  | test=3Dramspeed-1.4.3                          =
                                                 |
> > >> +------------------+------------------------------------------------=
-----------------------------------------------+
> > >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspe=
ed.Average.FloatingPoint.mb_s -3.0% regression |
> > >> | test machine     | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 =
CPU @ 3.20GHz (Coffee Lake) with 16G memory    |
> > >> | test parameters  | cpufreq_governor=3Dperformance                 =
                                                 |
> > >> |                  | option_a=3DAverage                             =
                                                 |
> > >> |                  | option_b=3DFloating Point                      =
                                                 |
> > >> |                  | test=3Dramspeed-1.4.3                          =
                                                 |
> > >> +------------------+------------------------------------------------=
-----------------------------------------------+
> > >>
> > >>
> > >> If you fix the issue in a separate patch/commit (i.e. not just a new=
 version of
> > >> the same patch/commit), kindly add following tags
> > >> | Reported-by: kernel test robot <oliver.sang@intel.com>
> > >> | Closes: https://lore.kernel.org/oe-lkp/202312192310.56367035-olive=
r.sang@intel.com
> > >>
> > >>
> > >> Details are as below:
> > >> --------------------------------------------------------------------=
------------------------------>
> > >>
> > >>
> > >> The kernel config and materials to reproduce are available at:
> > >> https://download.01.org/0day-ci/archive/20231219/202312192310.563670=
35-oliver.sang@intel.com
> > >>
> > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > >> class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tb=
ox_group/test/testcase/testtime:
> > >>    os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/1/debian-11.1-x86=
_64-20220510.cgz/lkp-csl-d02/pthread/stress-ng/60s
> > >>
> > >> commit:
> > >>    30749e6fbb ("mm/memory: replace kmap() with kmap_local_page()")
> > >>    1111d46b5c ("mm: align larger anonymous mappings on THP boundarie=
s")
> > >>
> > >> 30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75
> > >> ---------------- ---------------------------
> > >>           %stddev     %change         %stddev
> > >>               \          |                \
> > >>    13405796           -65.5%    4620124        cpuidle..usage
> > >>        8.00            +8.2%       8.66 =C4=85  2%  iostat.cpu.syste=
m
> > >>        1.61           -60.6%       0.63        iostat.cpu.user
> > >>      597.50 =C4=85 14%     -64.3%     213.50 =C4=85 14%  perf-c2c.DR=
AM.local
> > >>        1882 =C4=85 14%     -74.7%     476.83 =C4=85  7%  perf-c2c.HI=
TM.local
> > >>     3768436           -12.9%    3283395        vmstat.memory.cache
> > >>      355105           -75.7%      86344 =C4=85  3%  vmstat.system.cs
> > >>      385435           -20.7%     305714 =C4=85  3%  vmstat.system.in
> > >>        1.13            -0.2        0.88        mpstat.cpu.all.irq%
> > >>        0.29            -0.2        0.10 =C4=85  2%  mpstat.cpu.all.s=
oft%
> > >>        6.76 =C4=85  2%      +1.1        7.88 =C4=85  2%  mpstat.cpu.=
all.sys%
> > >>        1.62            -1.0        0.62 =C4=85  2%  mpstat.cpu.all.u=
sr%
> > >>     2234397           -84.3%     350161 =C4=85  5%  stress-ng.pthrea=
d.ops
> > >>       37237           -84.3%       5834 =C4=85  5%  stress-ng.pthrea=
d.ops_per_sec
> > >>      294706 =C4=85  2%     -68.0%      94191 =C4=85  6%  stress-ng.t=
ime.involuntary_context_switches
> > >>       41442 =C4=85  2%   +5023.4%    2123284        stress-ng.time.m=
aximum_resident_set_size
> > >>     4466457           -83.9%     717053 =C4=85  5%  stress-ng.time.m=
inor_page_faults
> > >
> > > The larger RSS and fewer page faults are expected.
> > >
> > >>      243.33           +13.5%     276.17 =C4=85  3%  stress-ng.time.p=
ercent_of_cpu_this_job_got
> > >>      131.64           +27.7%     168.11 =C4=85  3%  stress-ng.time.s=
ystem_time
> > >>       19.73           -82.1%       3.53 =C4=85  4%  stress-ng.time.u=
ser_time
> > >
> > > Much less user time. And it seems to match the drop of the pthread me=
tric.
> > >
> > >>     7715609           -80.2%    1530125 =C4=85  4%  stress-ng.time.v=
oluntary_context_switches
> > >>       76728           -80.8%      14724 =C4=85  4%  perf-stat.i.mino=
r-faults
> > >>     5600408           -61.4%    2160997 =C4=85  5%  perf-stat.i.node=
-loads
> > >>     8873996           +52.1%   13499744 =C4=85  5%  perf-stat.i.node=
-stores
> > >>      112409           -81.9%      20305 =C4=85  4%  perf-stat.i.page=
-faults
> > >>        2.55           +89.6%       4.83        perf-stat.overall.MPK=
I
> > >
> > > Much more TLB misses.
> > >
> > >>        1.51            -0.4        1.13        perf-stat.overall.bra=
nch-miss-rate%
> > >>       19.26           +24.5       43.71        perf-stat.overall.cac=
he-miss-rate%
> > >>        1.70           +56.4%       2.65        perf-stat.overall.cpi
> > >>      665.84           -17.5%     549.51 =C4=85  2%  perf-stat.overal=
l.cycles-between-cache-misses
> > >>        0.12 =C4=85  4%      -0.1        0.04        perf-stat.overal=
l.dTLB-load-miss-rate%
> > >>        0.08 =C4=85  2%      -0.0        0.03        perf-stat.overal=
l.dTLB-store-miss-rate%
> > >>       59.16            +0.9       60.04        perf-stat.overall.iTL=
B-load-miss-rate%
> > >>        1278           +86.1%       2379 =C4=85  2%  perf-stat.overal=
l.instructions-per-iTLB-miss
> > >>        0.59           -36.1%       0.38        perf-stat.overall.ipc
> > >
> > > Worse IPC and CPI.
> > >
> > >>   2.078e+09           -48.3%  1.074e+09 =C4=85  4%  perf-stat.ps.bra=
nch-instructions
> > >>    31292687           -61.2%   12133349 =C4=85  2%  perf-stat.ps.bra=
nch-misses
> > >>    26057291            -5.9%   24512034 =C4=85  4%  perf-stat.ps.cac=
he-misses
> > >>   1.353e+08           -58.6%   56072195 =C4=85  4%  perf-stat.ps.cac=
he-references
> > >>      365254           -75.8%      88464 =C4=85  3%  perf-stat.ps.con=
text-switches
> > >>   1.735e+10           -22.4%  1.346e+10 =C4=85  2%  perf-stat.ps.cpu=
-cycles
> > >>       60838           -79.1%      12727 =C4=85  6%  perf-stat.ps.cpu=
-migrations
> > >>     3056601 =C4=85  4%     -81.5%     565354 =C4=85  4%  perf-stat.p=
s.dTLB-load-misses
> > >>   2.636e+09           -50.7%    1.3e+09 =C4=85  4%  perf-stat.ps.dTL=
B-loads
> > >>     1155253 =C4=85  2%     -83.0%     196581 =C4=85  5%  perf-stat.p=
s.dTLB-store-misses
> > >>   1.473e+09           -57.4%  6.268e+08 =C4=85  3%  perf-stat.ps.dTL=
B-stores
> > >>     7997726           -73.3%    2131477 =C4=85  3%  perf-stat.ps.iTL=
B-load-misses
> > >>     5521346           -74.3%    1418623 =C4=85  2%  perf-stat.ps.iTL=
B-loads
> > >>   1.023e+10           -50.4%  5.073e+09 =C4=85  4%  perf-stat.ps.ins=
tructions
> > >>       75671           -80.9%      14479 =C4=85  4%  perf-stat.ps.min=
or-faults
> > >>     5549722           -61.4%    2141750 =C4=85  4%  perf-stat.ps.nod=
e-loads
> > >>     8769156           +51.6%   13296579 =C4=85  5%  perf-stat.ps.nod=
e-stores
> > >>      110795           -82.0%      19977 =C4=85  4%  perf-stat.ps.pag=
e-faults
> > >>   6.482e+11           -50.7%  3.197e+11 =C4=85  4%  perf-stat.total.=
instructions
> > >>        0.00 =C4=85 37%    -100.0%       0.00        perf-sched.sch_d=
elay.avg.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc_node.memcg_all=
oc_slab_cgroups.allocate_slab
> > >>        0.01 =C4=85 18%   +8373.1%       0.73 =C4=85 49%  perf-sched.=
sch_delay.avg.ms.__cond_resched.down_read.do_madvise.__x64_sys_madvise.do_s=
yscall_64
> > >>        0.01 =C4=85 16%   +4600.0%       0.38 =C4=85 24%  perf-sched.=
sch_delay.avg.ms.__cond_resched.down_read.exit_mm.do_exit.__x64_sys_exit
> > >
> > > More time spent in madvise and munmap. but I'm not sure whether this
> > > is caused by tearing down the address space when exiting the test. If
> > > so it should not count in the regression.
> > It's not for the whole address space tearing down. It's for pthread
> > stack tearing down when pthread exit (can be treated as address space
> > tearing down? I suppose so).
> >
> > https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384
> > https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576
>
> It explains the problem. The madvise() does have some extra overhead
> for handling THP (splitting pmd, deferred split queue, etc).
>
> >
> > Another thing is whether it's worthy to make stack use THP? It may be
> > useful for some apps which need large stack size?
>
> Kernel actually doesn't apply THP to stack (see
> vma_is_temporary_stack()). But kernel can't know whether the VMA is
> stack or not by checking VM_GROWSDOWN | VM_GROWSUP flags. So if glibc
> doesn't set the proper flags to tell kernel the area is stack, kernel
> just treats it as normal anonymous area. So glibc should set up stack
> properly IMHO.

If I read the code correctly, nptl allocates stack by the below code:

mem =3D __mmap (NULL, size, (guardsize =3D=3D 0) ? prot : PROT_NONE,
                        MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0);

See https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L563

The MAP_STACK is used, but it is a no-op on Linux. So the alternative
is to make MAP_STACK useful on Linux instead of changing glibc. But
the blast radius seems much wider.

>
> >
> >
> > Regards
> > Yin, Fengwei