From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78BA5C3DA6E for ; Thu, 21 Dec 2023 00:26:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0EBAC6B0074; Wed, 20 Dec 2023 19:26:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 09B8C6B0083; Wed, 20 Dec 2023 19:26:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E30486B0085; Wed, 20 Dec 2023 19:26:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CCC6E6B0083 for ; Wed, 20 Dec 2023 19:26:25 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 9E2E8A05C5 for ; Thu, 21 Dec 2023 00:26:25 +0000 (UTC) X-FDA: 81588933930.20.BA93240 Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) by imf08.hostedemail.com (Postfix) with ESMTP id BA69E160013 for ; Thu, 21 Dec 2023 00:26:23 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Vnrx8TXb; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1703118383; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mozwkjQZKTmY7+GKNLqKxkYYBFX5aqJHQCPsVvwkPaw=; b=1CjsK1UlIQ5pm8STiW+q9iYYn+gtfbBdYBlIq00KG3L8cTb117+kwyAxOlPJLuFvekeWS+ 3w86OyRapwtZm1oWnELZBaPd+y3hRDZj9bZnbwtBtxOjFP0/D1xbJd0OIwp3sgY4w5nbE4 JqRaLI5PPSSYi5Rb8dvdPOkIjUI4t10= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Vnrx8TXb; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.47 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1703118383; a=rsa-sha256; cv=none; b=Ufm0TdmA8qWksQdDVp/53WuN6bqbWHfprLwuMa5diehTyj43G/U+BpauKJj78Aea9K03WH vBK54sWeKYYNxKv+zCogYEzW62wRIko/KKFgNZSiELl9tWrWBhpy7jhvoxyQnYDVgzlFKF 8cgRXPOjwwtZNM1hNwgfVdR1eUTb69k= Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-28b09aeca73so178658a91.1 for ; Wed, 20 Dec 2023 16:26:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1703118382; x=1703723182; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=mozwkjQZKTmY7+GKNLqKxkYYBFX5aqJHQCPsVvwkPaw=; b=Vnrx8TXbZYuFlEMOJgVpZxFUcOT75fyqQbZxTcWkoWnpmMfrX0GChvDe0F9dglfTNq +rTBX4T0ZlZwHqDhDxZSS+/EQBgPTmzUUdFQwBXsB7CUGlFmXZMI5UbKDtULLgnKbnV0 0HpPO94Rg5TvekZB3bDRaZEMWXcRIon4Kn54MIPJQX1qZk75S6Qg35i8W5hyU4wL8TO/ keedTK+jO0CLMO+GSbJ3HwoGOktD/H8yjZm0CJKDClHBu/p+702nZlWBWup2iRlrHXFC /Dho4nG+FXp9H2VIqEkaLNcahBYYswX+rvzhel1XaqDehSvCGilrygTUDkaYqqyUIkMV TY8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1703118382; x=1703723182; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mozwkjQZKTmY7+GKNLqKxkYYBFX5aqJHQCPsVvwkPaw=; b=CMYw/hxBvLOtKv9krwfqX6gitSBo9w7hVu7hvDLvcblwfKFq8/GFcse8ZSzWPPVUGM nnL4PwP8DmdI48dk0kCLMh8JAWHtLoTGiCmaS2jh53wap1S5afCWvo0TtZUxu/mE6r88 uBsCU3rDOqilkVBNMBX0966AGOHWREjSefAolgvLQftuktKq11xm/f0PIiHDwXIXymN+ CjWKZHwNk9NOLbL/HqDEiihv2HzXvYDT1bh2x6/PQewEqXuPMAYgJPKnsl7fG18Rqbx7 Cl/DE9njx/w8DCYGGHZDka6F/qA38nYoSPjQ9K9CyMGy0UJTjCIbWEJOq7/73XNMDsL+ 1D5g== X-Gm-Message-State: AOJu0YzXhLoGiGqWV+FN9LJ4wRRQmb2kWiJT0aj037lY3C69m5bR6Pqb kclQ4uz7KCV2sd0igwxEQ17PgQ42vULXoDys4mU= X-Google-Smtp-Source: AGHT+IF3HH/bDRiHBA+RCl4ndXEW8MY4WBZo4telT2vRPtbC5PT0UocC4thyq+4qhL4JlmesXYgOyiwtdCjbykP9ScY= X-Received: by 2002:a17:90a:3006:b0:28b:e278:f4c4 with SMTP id g6-20020a17090a300600b0028be278f4c4mr1230907pjb.34.1703118382445; Wed, 20 Dec 2023 16:26:22 -0800 (PST) MIME-Version: 1.0 References: <202312192310.56367035-oliver.sang@intel.com> In-Reply-To: From: Yang Shi Date: Wed, 20 Dec 2023 16:26:10 -0800 Message-ID: Subject: Re: [linux-next:master] [mm] 1111d46b5c: stress-ng.pthread.ops_per_sec -84.3% regression To: Yin Fengwei Cc: kernel test robot , Rik van Riel , oe-lkp@lists.linux.dev, lkp@intel.com, Linux Memory Management List , Andrew Morton , Matthew Wilcox , Christopher Lameter , ying.huang@intel.com, feng.tang@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: BA69E160013 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: ctpjnw7bjpzb5cio3i8h36kkmosgh9zh X-HE-Tag: 1703118383-807892 X-HE-Meta: U2FsdGVkX1/ZHZFu83qnpb3n/1k5hOoo7Ud3E0K6FlVVma5IyUoeWzmjbrY4llIhcrDuGRUbVCWYjguBcsaiHmMDvCwW/hkuL5bPctNUQO56nuVrfGDpxrL7ms7KYQqWVw/bcA9ioWQUqWMumaQLFD6LMKQgVAMFpYLTT+CkPXo8aFaybUCYDLE8OYC+zx4BGK9GY1SLCagmBNrPskDJZSCrQKIL0vswQiLIShjDvRLHxcInws99SbL+CqyU2Nyxwzg/+aTUcKmxhAZexiszarm+fmHgiKj/vMcTYmh5Xqc7MwobVzwCwT2cw5Ty6f4IM/z46hHyB9smuLkqZzTePmsqcp3QpO0WM3qI085wSntCI/un6cWP5WZeHXfAUa0hLS0UivzXw/Jxpd8Orr+SOeVhyGemOcA3ukykHeIT2s01Rpl4yT320pEbLWeCaViF8Y/JqmezmPlHLzRyXdiYx3tNc7v6IEyN7AMGb5pI4/llO96Qay4GaXUIdPY6IWARCCacPT9i2J5aYMGGz6+nIcdZG7vUvUkcpn3FmetFN0O4Gi4YvNUhe0mpRX5X/3OlNgCOY4BPpNsOTzg0XfEkd1aBwhSjFvkNOdUN21KgEXoexa7PC3ZFjAYjIkcmPbPJRY9S0YDMz+bJg0vwJcDUyGODqJI2Vl4PKmeCX89+LyeKhwXl05ybl2phvpS0vSOZJsqrxJqbOs4uA1hhTwfAeQEtujGe4vE+2hJPlZGKN1btmNNRHdQb+pNxrfAouv0Yf6zRwGe+Ro3U7tSVQrTOKZHeEVqrRR2gvteJBq9pJ5DMFPcpSMFMOnyS+31TpZhgUXLzlAAbntkdjOkqSpKhW0if6UDxy+OPU4UTl4eiZ8Z+RQh7e7I240n/2hx1VdR4UUvfMO0Lz0tITiGMW5bI5c7MvFZL8XfoOWKLGN8+O+vfaYr/vdJ1yrjmeQaR7jK3r07TReauvxm2P7R99sL cbWL5qKn 0iiSPc57m3crW5SNy1txqjX+Ij5ChsGSbSgy3RKYcXr50MDfcNEhVoZfWnEwjvlAXKr776fmfZc7M1t9KCtLDa+bYVZr3+w/CT6oAMYLaptZsCqkpe++SqvOAXzhz5OgEraSu8Ap07dOCQzQshReAX43jUc3VoWmmbLXuUgC4XjOPnmyU15pw6ZOHRNqwQxIZSjBbVrLwINXhDMuEPSLgOfeBDhMw5xrL6xOVhWyY9b7ixeP1y7zt3izhWwiCK5yIlamE/f9yVBc1mt4Zy77b6txOX5HnbdPRSmBt5XBH4hPQnPZFr/iiKgTx9KLkqZtctKQ6hzdAgjHk+1WiNVGCfiGRwvnR7O7s34Q7MqedZHiSUH6/KHn/C11web9Jc3ur1qfni5Uaabh+AMlxibpzliaxr2k/EiLuhJSlkn2s98XJerkGl/GV/YIIj0LHBH0EHwnmx4bSYnfC3rckL+iwocjtfmiwUZA0DNMxWHXUgykoxv41N6qPt6H0rbhCOtcmztB90q7QrwldsBVMbT+uE7ahyDjofnpS6rZ39MjEB8b0Nv+B16QBWNrvC8iI9zfGFEIV9swh/eABJLCNKq65UtojDGwMY9VMcy5ibJfmTjfN4QscWjKwU44ipQ1/e1Eg2ZqBjNHqESPaiTR18fuhjA0rd5H6kaJVQpiwDs7ECX/k6Hwxv8WdDoF2Qd2dALasMHhn78wLicg9lrE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Dec 20, 2023 at 12:09=E2=80=AFPM Yang Shi wro= te: > > On Wed, Dec 20, 2023 at 12:34=E2=80=AFAM Yin Fengwei wrote: > > > > > > > > On 2023/12/20 13:27, Yang Shi wrote: > > > On Tue, Dec 19, 2023 at 7:41=E2=80=AFAM kernel test robot wrote: > > >> > > >> > > >> > > >> Hello, > > >> > > >> for this commit, we reported > > >> "[mm] 96db82a66d: will-it-scale.per_process_ops -95.3% regression" > > >> in Aug, 2022 when it's in linux-next/master > > >> https://lore.kernel.org/all/YwIoiIYo4qsYBcgd@xsang-OptiPlex-9020/ > > >> > > >> later, we reported > > >> "[mm] f35b5d7d67: will-it-scale.per_process_ops -95.5% regression" > > >> in Oct, 2022 when it's in linus/master > > >> https://lore.kernel.org/all/202210181535.7144dd15-yujie.liu@intel.co= m/ > > >> > > >> and the commit was reverted finally by > > >> commit 0ba09b1733878afe838fe35c310715fda3d46428 > > >> Author: Linus Torvalds > > >> Date: Sun Dec 4 12:51:59 2022 -0800 > > >> > > >> now we noticed it goes into linux-next/master again. > > >> > > >> we are not sure if there is an agreement that the benefit of this co= mmit > > >> has already overweight performance drop in some mirco benchmark. > > >> > > >> we also noticed from https://lore.kernel.org/all/20231214223423.1133= 074-1-yang@os.amperecomputing.com/ > > >> that > > >> "This patch was applied to v6.1, but was reverted due to a regressio= n > > >> report. However it turned out the regression was not due to this pa= tch. > > >> I ping'ed Andrew to reapply this patch, Andrew may forget it. This > > >> patch helps promote THP, so I rebased it onto the latest mm-unstable= ." > > > > > > IIRC, Huang Ying's analysis showed the regression for will-it-scale > > > micro benchmark is fine, it was actually reverted due to kernel build > > > regression with LLVM reported by Nathan Chancellor. Then the > > > regression was resolved by commit > > > 81e506bec9be1eceaf5a2c654e28ba5176ef48d8 ("mm/thp: check and bail out > > > if page in deferred queue already"). And this patch did improve kerne= l > > > build with GCC by ~3% if I remember correctly. > > > > > >> > > >> however, unfortunately, in our latest tests, we still observed below= regression > > >> upon this commit. just FYI. > > >> > > >> > > >> > > >> kernel test robot noticed a -84.3% regression of stress-ng.pthread.o= ps_per_sec on: > > > > > > Interesting, wasn't the same regression seen last time? And I'm a > > > little bit confused about how pthread got regressed. I didn't see the > > > pthread benchmark do any intensive memory alloc/free operations. Do > > > the pthread APIs do any intensive memory operations? I saw the > > > benchmark does allocate memory for thread stack, but it should be jus= t > > > 8K per thread, so it should not trigger what this patch does. With > > > 1024 threads, the thread stacks may get merged into one single VMA (8= M > > > total), but it may do so even though the patch is not applied. > > stress-ng.pthread test code is strange here: > > > > https://github.com/ColinIanKing/stress-ng/blob/master/stress-pthread.c#= L573 > > > > Even it allocates its own stack, but that attr is not passed > > to pthread_create. So it's still glibc to allocate stack for > > pthread which is 8M size. This is why this patch can impact > > the stress-ng.pthread testing. > > Aha, nice catch, I overlooked that. > > > > > > > My understanding is this is different regression (if it's a valid > > regression). The previous hotspot was in: > > deferred_split_huge_page > > deferred_split_huge_page > > deferred_split_huge_page > > spin_lock > > > > while this time, the hotspot is in (pmd_lock from do_madvise I suppose)= : > > - 55.02% zap_pmd_range.isra.0 > > - 53.42% __split_huge_pmd > > - 51.74% _raw_spin_lock > > - 51.73% native_queued_spin_lock_slowpath > > + 3.03% asm_sysvec_call_function > > - 1.67% __split_huge_pmd_locked > > - 0.87% pmdp_invalidate > > + 0.86% flush_tlb_mm_range > > - 1.60% zap_pte_range > > - 1.04% page_remove_rmap > > 0.55% __mod_lruvec_page_state > > > > > > > > > >> > > >> > > >> commit: 1111d46b5cbad57486e7a3fab75888accac2f072 ("mm: align larger = anonymous mappings on THP boundaries") > > >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git mas= ter > > >> > > >> testcase: stress-ng > > >> test machine: 36 threads 1 sockets Intel(R) Core(TM) i9-10980XE CPU = @ 3.00GHz (Cascade Lake) with 128G memory > > >> parameters: > > >> > > >> nr_threads: 1 > > >> disk: 1HDD > > >> testtime: 60s > > >> fs: ext4 > > >> class: os > > >> test: pthread > > >> cpufreq_governor: performance > > >> > > >> > > >> In addition to that, the commit also has significant impact on the f= ollowing tests: > > >> > > >> +------------------+------------------------------------------------= -----------------------------------------------+ > > >> | testcase: change | stream: stream.triad_bandwidth_MBps -12.1% regr= ession | > > >> | test machine | 224 threads 2 sockets Intel(R) Xeon(R) Platinum= 8480CTDX (Sapphire Rapids) with 512G memory | > > >> | test parameters | array_size=3D50000000 = | > > >> | | cpufreq_governor=3Dperformance = | > > >> | | iterations=3D10x = | > > >> | | loop=3D100 = | > > >> | | nr_threads=3D25% = | > > >> | | omp=3Dtrue = | > > >> +------------------+------------------------------------------------= -----------------------------------------------+ > > >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspe= ed.Average.Integer.mb_s -3.5% regression | > > >> | test machine | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 = CPU @ 3.20GHz (Coffee Lake) with 16G memory | > > >> | test parameters | cpufreq_governor=3Dperformance = | > > >> | | option_a=3DAverage = | > > >> | | option_b=3DInteger = | > > >> | | test=3Dramspeed-1.4.3 = | > > >> +------------------+------------------------------------------------= -----------------------------------------------+ > > >> | testcase: change | phoronix-test-suite: phoronix-test-suite.ramspe= ed.Average.FloatingPoint.mb_s -3.0% regression | > > >> | test machine | 12 threads 1 sockets Intel(R) Core(TM) i7-8700 = CPU @ 3.20GHz (Coffee Lake) with 16G memory | > > >> | test parameters | cpufreq_governor=3Dperformance = | > > >> | | option_a=3DAverage = | > > >> | | option_b=3DFloating Point = | > > >> | | test=3Dramspeed-1.4.3 = | > > >> +------------------+------------------------------------------------= -----------------------------------------------+ > > >> > > >> > > >> If you fix the issue in a separate patch/commit (i.e. not just a new= version of > > >> the same patch/commit), kindly add following tags > > >> | Reported-by: kernel test robot > > >> | Closes: https://lore.kernel.org/oe-lkp/202312192310.56367035-olive= r.sang@intel.com > > >> > > >> > > >> Details are as below: > > >> --------------------------------------------------------------------= ------------------------------> > > >> > > >> > > >> The kernel config and materials to reproduce are available at: > > >> https://download.01.org/0day-ci/archive/20231219/202312192310.563670= 35-oliver.sang@intel.com > > >> > > >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > >> class/compiler/cpufreq_governor/disk/fs/kconfig/nr_threads/rootfs/tb= ox_group/test/testcase/testtime: > > >> os/gcc-12/performance/1HDD/ext4/x86_64-rhel-8.3/1/debian-11.1-x86= _64-20220510.cgz/lkp-csl-d02/pthread/stress-ng/60s > > >> > > >> commit: > > >> 30749e6fbb ("mm/memory: replace kmap() with kmap_local_page()") > > >> 1111d46b5c ("mm: align larger anonymous mappings on THP boundarie= s") > > >> > > >> 30749e6fbb3d391a 1111d46b5cbad57486e7a3fab75 > > >> ---------------- --------------------------- > > >> %stddev %change %stddev > > >> \ | \ > > >> 13405796 -65.5% 4620124 cpuidle..usage > > >> 8.00 +8.2% 8.66 =C4=85 2% iostat.cpu.syste= m > > >> 1.61 -60.6% 0.63 iostat.cpu.user > > >> 597.50 =C4=85 14% -64.3% 213.50 =C4=85 14% perf-c2c.DR= AM.local > > >> 1882 =C4=85 14% -74.7% 476.83 =C4=85 7% perf-c2c.HI= TM.local > > >> 3768436 -12.9% 3283395 vmstat.memory.cache > > >> 355105 -75.7% 86344 =C4=85 3% vmstat.system.cs > > >> 385435 -20.7% 305714 =C4=85 3% vmstat.system.in > > >> 1.13 -0.2 0.88 mpstat.cpu.all.irq% > > >> 0.29 -0.2 0.10 =C4=85 2% mpstat.cpu.all.s= oft% > > >> 6.76 =C4=85 2% +1.1 7.88 =C4=85 2% mpstat.cpu.= all.sys% > > >> 1.62 -1.0 0.62 =C4=85 2% mpstat.cpu.all.u= sr% > > >> 2234397 -84.3% 350161 =C4=85 5% stress-ng.pthrea= d.ops > > >> 37237 -84.3% 5834 =C4=85 5% stress-ng.pthrea= d.ops_per_sec > > >> 294706 =C4=85 2% -68.0% 94191 =C4=85 6% stress-ng.t= ime.involuntary_context_switches > > >> 41442 =C4=85 2% +5023.4% 2123284 stress-ng.time.m= aximum_resident_set_size > > >> 4466457 -83.9% 717053 =C4=85 5% stress-ng.time.m= inor_page_faults > > > > > > The larger RSS and fewer page faults are expected. > > > > > >> 243.33 +13.5% 276.17 =C4=85 3% stress-ng.time.p= ercent_of_cpu_this_job_got > > >> 131.64 +27.7% 168.11 =C4=85 3% stress-ng.time.s= ystem_time > > >> 19.73 -82.1% 3.53 =C4=85 4% stress-ng.time.u= ser_time > > > > > > Much less user time. And it seems to match the drop of the pthread me= tric. > > > > > >> 7715609 -80.2% 1530125 =C4=85 4% stress-ng.time.v= oluntary_context_switches > > >> 76728 -80.8% 14724 =C4=85 4% perf-stat.i.mino= r-faults > > >> 5600408 -61.4% 2160997 =C4=85 5% perf-stat.i.node= -loads > > >> 8873996 +52.1% 13499744 =C4=85 5% perf-stat.i.node= -stores > > >> 112409 -81.9% 20305 =C4=85 4% perf-stat.i.page= -faults > > >> 2.55 +89.6% 4.83 perf-stat.overall.MPK= I > > > > > > Much more TLB misses. > > > > > >> 1.51 -0.4 1.13 perf-stat.overall.bra= nch-miss-rate% > > >> 19.26 +24.5 43.71 perf-stat.overall.cac= he-miss-rate% > > >> 1.70 +56.4% 2.65 perf-stat.overall.cpi > > >> 665.84 -17.5% 549.51 =C4=85 2% perf-stat.overal= l.cycles-between-cache-misses > > >> 0.12 =C4=85 4% -0.1 0.04 perf-stat.overal= l.dTLB-load-miss-rate% > > >> 0.08 =C4=85 2% -0.0 0.03 perf-stat.overal= l.dTLB-store-miss-rate% > > >> 59.16 +0.9 60.04 perf-stat.overall.iTL= B-load-miss-rate% > > >> 1278 +86.1% 2379 =C4=85 2% perf-stat.overal= l.instructions-per-iTLB-miss > > >> 0.59 -36.1% 0.38 perf-stat.overall.ipc > > > > > > Worse IPC and CPI. > > > > > >> 2.078e+09 -48.3% 1.074e+09 =C4=85 4% perf-stat.ps.bra= nch-instructions > > >> 31292687 -61.2% 12133349 =C4=85 2% perf-stat.ps.bra= nch-misses > > >> 26057291 -5.9% 24512034 =C4=85 4% perf-stat.ps.cac= he-misses > > >> 1.353e+08 -58.6% 56072195 =C4=85 4% perf-stat.ps.cac= he-references > > >> 365254 -75.8% 88464 =C4=85 3% perf-stat.ps.con= text-switches > > >> 1.735e+10 -22.4% 1.346e+10 =C4=85 2% perf-stat.ps.cpu= -cycles > > >> 60838 -79.1% 12727 =C4=85 6% perf-stat.ps.cpu= -migrations > > >> 3056601 =C4=85 4% -81.5% 565354 =C4=85 4% perf-stat.p= s.dTLB-load-misses > > >> 2.636e+09 -50.7% 1.3e+09 =C4=85 4% perf-stat.ps.dTL= B-loads > > >> 1155253 =C4=85 2% -83.0% 196581 =C4=85 5% perf-stat.p= s.dTLB-store-misses > > >> 1.473e+09 -57.4% 6.268e+08 =C4=85 3% perf-stat.ps.dTL= B-stores > > >> 7997726 -73.3% 2131477 =C4=85 3% perf-stat.ps.iTL= B-load-misses > > >> 5521346 -74.3% 1418623 =C4=85 2% perf-stat.ps.iTL= B-loads > > >> 1.023e+10 -50.4% 5.073e+09 =C4=85 4% perf-stat.ps.ins= tructions > > >> 75671 -80.9% 14479 =C4=85 4% perf-stat.ps.min= or-faults > > >> 5549722 -61.4% 2141750 =C4=85 4% perf-stat.ps.nod= e-loads > > >> 8769156 +51.6% 13296579 =C4=85 5% perf-stat.ps.nod= e-stores > > >> 110795 -82.0% 19977 =C4=85 4% perf-stat.ps.pag= e-faults > > >> 6.482e+11 -50.7% 3.197e+11 =C4=85 4% perf-stat.total.= instructions > > >> 0.00 =C4=85 37% -100.0% 0.00 perf-sched.sch_d= elay.avg.ms.__cond_resched.__kmem_cache_alloc_node.__kmalloc_node.memcg_all= oc_slab_cgroups.allocate_slab > > >> 0.01 =C4=85 18% +8373.1% 0.73 =C4=85 49% perf-sched.= sch_delay.avg.ms.__cond_resched.down_read.do_madvise.__x64_sys_madvise.do_s= yscall_64 > > >> 0.01 =C4=85 16% +4600.0% 0.38 =C4=85 24% perf-sched.= sch_delay.avg.ms.__cond_resched.down_read.exit_mm.do_exit.__x64_sys_exit > > > > > > More time spent in madvise and munmap. but I'm not sure whether this > > > is caused by tearing down the address space when exiting the test. If > > > so it should not count in the regression. > > It's not for the whole address space tearing down. It's for pthread > > stack tearing down when pthread exit (can be treated as address space > > tearing down? I suppose so). > > > > https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L384 > > https://github.com/lattera/glibc/blob/master/nptl/pthread_create.c#L576 > > It explains the problem. The madvise() does have some extra overhead > for handling THP (splitting pmd, deferred split queue, etc). > > > > > Another thing is whether it's worthy to make stack use THP? It may be > > useful for some apps which need large stack size? > > Kernel actually doesn't apply THP to stack (see > vma_is_temporary_stack()). But kernel can't know whether the VMA is > stack or not by checking VM_GROWSDOWN | VM_GROWSUP flags. So if glibc > doesn't set the proper flags to tell kernel the area is stack, kernel > just treats it as normal anonymous area. So glibc should set up stack > properly IMHO. If I read the code correctly, nptl allocates stack by the below code: mem =3D __mmap (NULL, size, (guardsize =3D=3D 0) ? prot : PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_STACK, -1, 0); See https://github.com/lattera/glibc/blob/master/nptl/allocatestack.c#L563 The MAP_STACK is used, but it is a no-op on Linux. So the alternative is to make MAP_STACK useful on Linux instead of changing glibc. But the blast radius seems much wider. > > > > > > > Regards > > Yin, Fengwei