From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E329C04FFE for ; Wed, 8 May 2024 15:25:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90E656B007B; Wed, 8 May 2024 11:25:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8BE1A6B0089; Wed, 8 May 2024 11:25:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 75E926B008C; Wed, 8 May 2024 11:25:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 576D86B007B for ; Wed, 8 May 2024 11:25:42 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 090F8140984 for ; Wed, 8 May 2024 15:25:42 +0000 (UTC) X-FDA: 82095603324.20.8D6A96B Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by imf23.hostedemail.com (Postfix) with ESMTP id 1BF3314000C for ; Wed, 8 May 2024 15:25:38 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AU2HqyYS; spf=pass (imf23.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715181939; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=F6NPiKyKurDlabmpC6QSly3h2VfCRzqtNRBYDTEGWtQ=; b=wW4INj17GUlnt5BVjk24leIUXQmMjZ5tjJJzR+ZS5q5FJYrH6zxBeSZSGfPqzCQL3sjj3c WhdFJi51ZFh0eHPzqCooL1SyIWJRptRI+XzvY7QTAkU3vJ7lEzu6x89s+eZakQTgLN3EoL YK72YWbx9vrVXiUzgwZ+R9io5lWeL5g= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AU2HqyYS; spf=pass (imf23.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715181939; a=rsa-sha256; cv=none; b=DRmWMwpLYhwM9dCWQ9SyYaqO1N9G9R4wZJgQS5yho8RE47ahTPuL0vgOGJqQqOz/1M9vAD KG0DT6kmDwmPXvQWOkPpYbKQjX2AfENQiyNQEM/V/pVHgGSnpY5/B0GM9gTHof8mRFz5+9 ho5qT3s0Kc6/U9icMQZ1NVf+Ix+LJRo= Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-51f2ebbd8a7so5270089e87.2 for ; Wed, 08 May 2024 08:25:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1715181937; x=1715786737; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=F6NPiKyKurDlabmpC6QSly3h2VfCRzqtNRBYDTEGWtQ=; b=AU2HqyYSiKbaCklK5SOmyScv+lSkQU1GGa5OTI+sKvLC62YqjlBxAw/pvduR815GIB tw2i0Iib9kaBRWOR8smuavGVkLH/MmceR77Wvt9fRE42pPsA/fV6UC/wCBjNKoZVl8eq OGfglUf8gTyjVPKt0j1vMYj21yDcR0eDbYn9rNFBYkR0LPsbXouepG2gJhhmdk5U8gj9 VP1td28FHVVbA81nkaLkCaE1W8Gqtk5cmBQ80pweP1nY5a1D7xpyrtC8+7kDGyBlvdu2 YP6EJtNYLYSv+ee3OrhjYL3/PqjZJobB9yEwvg9O1u/lj/GsE1DfTGdihkuEVwZlkaMA 8/Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1715181937; x=1715786737; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=F6NPiKyKurDlabmpC6QSly3h2VfCRzqtNRBYDTEGWtQ=; b=eAe5TPgNXoAZTTVdhc0ISRV0MHIuEXWygsUrblXVn1zzUFgVWg8QAemd630Nm0Je5H l7pHsjgHfxA/coO961fHG84nk/0zwsHT8mVff67L9VG6Bg2KuQOSgpq3THQXJ0IBQUoR Zn2xF6cSbJ7Oqwj1lmJ0M1VCu4WbDqj8pL832Qnl6XzyZnrpUenTSBwR33aIiSACGf9P k0aJB2GnfbchFWUqSJmGiPl5tsuUR1rwa/NanV4SWNdvsxayaD+wZm84V9YcshLhbInB jKy6psun8vNyueOIMfRjZ/XirB1zPymfR9jRKcgaRP6vykT7ZumrxxFTPon7844kjnoD EROg== X-Forwarded-Encrypted: i=1; AJvYcCVt4H/5SwmY59VXer1hECj1LVmOA/fHXlzvg3sjgdByHlTzKI0oVsXtozBp85kqHjEqwqk1oQR6xwvzqt7SGprJgKQ= X-Gm-Message-State: AOJu0YwAxq19oa1uXtmUR/S99kfm05fkyMlHRHUrtBuKLbHZJ3Hp88E7 HxQI7mKTsExbIQI/Cc4NJuytP+bqZCqatd0NON9VnGkfAHPxtKaOCSXgEF4gNcN8RigRdWzAo0l bQH4uuIe8hReEHaouy8Ej8f0oKBn/3oqu X-Google-Smtp-Source: AGHT+IEbBjMMfSQ1BZQCIZeVVo1+efpRaTdZ06EBYRC7uNqEeLb2x1Xc3ztkKK6L9S5kwVybN+ytYEjiVtk+jlUF1pU= X-Received: by 2002:ac2:42cd:0:b0:51c:d876:710c with SMTP id 2adb3069b0e04-5217c667197mr1740215e87.37.1715181937073; Wed, 08 May 2024 08:25:37 -0700 (PDT) MIME-Version: 1.0 References: <20231214223423.1133074-1-yang@os.amperecomputing.com> <1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.com> <1dc9a561-55f7-4d65-8b86-8a40fa0e84f9@arm.com> <6016c0e9-b567-4205-8368-1f1c76184a28@huawei.com> <2c14d9ad-c5a3-4f29-a6eb-633cdf3a5e9e@redhat.com> <2b403705-a03c-4cfe-8d95-b38dd83fca52@arm.com> <281aebf1-0bff-4858-b479-866eb05b9e94@huawei.com> <219cb8e3-a77b-468b-9d69-0b3e386f93f6@arm.com> <7d8c43b6-b1ef-428e-9d6a-1c26284feb26@huawei.com> In-Reply-To: <7d8c43b6-b1ef-428e-9d6a-1c26284feb26@huawei.com> From: Yang Shi Date: Wed, 8 May 2024 08:25:25 -0700 Message-ID: Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries To: Kefeng Wang Cc: Ryan Roberts , David Hildenbrand , Matthew Wilcox , Yang Shi , riel@surriel.com, cl@linux.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ze Zuo Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: iik1y77tohka3nyhmkyjry95e5wnkxu5 X-Rspamd-Queue-Id: 1BF3314000C X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1715181938-685683 X-HE-Meta: U2FsdGVkX18Hb9SPUdBtSatd2KL6GeziFDO0LtQIxI7LSO77sOqY+TfQ8U73ha5YYx24qJbUJbVehMIKfL+FTX4SskVqK7ASVwqaHZcWE19dSIe6pufQxwpam+vhVmsN3uI5HD2ccNQe4Vb+owXm9/L3Oh89Ivg3fRcsTwZckoSt487gHH/Br0zupMoZ/rKd7Cvk8s2K88NzPIskZxyKV4HHN+R5GD9btv0ytfUpqw4CuzRqwzfvivsyYbpTRa90Vbc1cE0d84oZyz6qwbwGRsrP3VbNHu7geCVpyab/1d1qcGxkZnQ6uR46zCSP5+EnsnxqvmzH0rNSQnydM7i4ptfDIExZbOUhEFqE+NqHH80/q7BYlCuQJSnff6bR8HuPw6aPfHZfBIWiHm+nlIL5xTuKke7rQ2Ehn2OdXhig+5z1KzsZDTJRHlZzun4dGEb9sVdFqX88TksvR1gPVvSbGQGvt1EVLaWR8DSM32DGy3vXA3AeAy8xDvIpS9uhXZKKwHjLTPWgodlzp1vS4TedkKOf4Aw0Y+q61C3JWPRWJfeAxEO0eLK7bqBzjV8A7ZcYBC3YoYAOeRXX8SOVyYGKFj/IZA8pNtPQy8Xv6N706U9iL2Hlt0sIQ+7+Te89lpJUxxRQJ7A0aadjbY+9rtxBOySXPeUSzyVyk86y/ZIBIA8QOqMcMF/qGHVfKw1pl+X8JUgjLillEpI3SRBqSZap1GAFSGf7NE/WlIchhFbmIrG4XMRR4U0AcTrtcHLCh5bps2U/dzcj4Eb9StEtpZ6CpiWDsRLr0Ix/arsq5ekI+06CASSA6OblDRi8Id3AqLdWw7p2gkgI3vj9YmXp7TT9shoMRMlnUH+P2PDPlUFMN0wIDkAMojYU7owfwSLnUYuzdhdW6gAhc6NF3eXLUtELscuPbkI4vi33mmcR3p5iDDtapTzg+Sk73BdUsm6fDzRlkpZ2gV66e6pxoO35s58 tbuySqrl WpWUrELAB8t4C0BkEb1c65+KWWxuMVmSJIVPSEXPmrLKM0LdN15x05txa2J4lGdcpFWHE16x9tpygpQ4QGg9y8CWABjCXVnN1/S/N3AGUfj0tL41wGbi3j0L3eR3Q4KMExG2n1dodPK6VQzFyiqXSXWgQC6/2yy7mPUI1wemZzjYpf+mDh9mvj0FMVaZgYq4saXqkucetZ16+f4WbTfGaiegWPYiZO8WuyYDhKEwTe8jfIgKkLmn6OnaFt22iT1kz7me354HCwS2qDa3LgZqbXpeUEfOAAsKlAodoXns3H8STpCXVfOQRUeqd/8OG52yNnR4VJK7sO6288G4JaYi5D4bFHXtQ+TWWpswnwEARgyb8OIjJ8M0U1EHNCSgToIvAG9kvd+UC0uK8zZo2sfXA28PIFhUSFeub8vNYO8FeiZPB45Lm5HSK40Ag++RKM4E+1+4Yxd8cMhiKF8M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, May 8, 2024 at 6:37=E2=80=AFAM Kefeng Wang wrote: > > > > On 2024/5/8 16:36, Ryan Roberts wrote: > > On 08/05/2024 08:48, Kefeng Wang wrote: > >> > >> > >> On 2024/5/8 1:17, Yang Shi wrote: > >>> On Tue, May 7, 2024 at 8:53=E2=80=AFAM Ryan Roberts wrote: > >>>> > >>>> On 07/05/2024 14:53, Kefeng Wang wrote: > >>>>> > >>>>> > >>>>> On 2024/5/7 19:13, David Hildenbrand wrote: > >>>>>> > >>>>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 > >>>>>>> > >>>>>>>> suggest. If you want to try something semi-randomly; it might be= useful > >>>>>>>> to rule > >>>>>>>> out the arm64 contpte feature. I don't see how that would be int= eracting > >>>>>>>> here if > >>>>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. D= isable with > >>>>>>>> ARM64_CONTPTE (needs EXPERT) at compile time. > >>>>>>> I don't enabled mTHP, so it should be not related about ARM64_CON= TPTE, > >>>>>>> but will have a try. > >>>>> > >>>>> After ARM64_CONTPTE disabled, memory read latency is similar with A= RM64_CONTPTE > >>>>> enabled(default 6.9-rc7), still larger than align anon reverted. > >>>> > >>>> OK thanks for trying. > >>>> > >>>> Looking at the source for lmbench, its malloc'ing (512M + 8K) up fro= nt and using > >>>> that for all sizes. That will presumably be considered "large" by ma= lloc and > >>>> will be allocated using mmap. So with the patch, it will be 2M align= ed. Without > >>>> it, it probably won't. I'm still struggling to understand why not al= igning it in > >>>> virtual space would make it more performant though... > >>> > >>> Yeah, I'm confused too. > >> Me too, I get a smaps[_rollup] for 0.09375M size, the biggest differen= ce > >> for anon shows below, and all attached. > > > > OK, a bit more insight; during initialization, the test makes 2 big mal= loc > > calls; the first is 1M and the second is 512M+8K. I think those 2 are t= he 2 vmas > > below (malloc is adding an extra page to the allocation, presumably for > > management structures). > > > > With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-= aligned > > address. All of its pages are populated (see permutation() which alloca= tes and > > writes it) but none of them are THP (obviously - its only 1M and THP is= only > > enabled for 2M). But the 512M region is allocated at a THP-aligned addr= ess. And > > the first page is populated with a THP (presumably faulted when malloc = writes to > > its control structure page before the application even sees the allocat= ed buffer. > > > > In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP= -aligned, > > and therefore the 512M region abutts the 1M region and the vmas are mer= ged in > > the kernel. So we end up with the single 525328 kB region. There are no= THPs > > allocated here (due to alignment constraiints) so we end up with the 1M= region > > fully populated with 4K pages as before, and only the malloc control pa= ge plus > > the parts of the buffer that the application actually touches being pop= ulated in > > the 512M region. > > > > As far as I can tell, the application never touches the 1M region durin= g the > > test so it should be cache-cold. It only touches the first part of the = 512M > > buffer it needs for the size of the test (96K here?). The latency of al= locating > > the THP will have been consumed during test setup so I doubt we are see= ing that > > in the test results and I don't see why having a single TLB entry vs 96= K/4K=3D24 > > entries would make it slower. > > It is strange, and even more stranger, I got another machine(old machine > 128 core and the new machine 96 core, but with same L1/L2 cache size > per-core), the new machine without this issue, will contact with our > hardware team, maybe some different configurations(prefetch or some > other similar hardware configurations) , thank for all the suggestion > and analysis! Yes, the benchmark result strongly relies on cache and memory subsystem. See the below analysis. > > > > > > It would be interesting to know the address that gets returned from mal= loc for > > the 512M region if that's possible to get (in both cases)? I guess it i= s offset > > into the first page. Perhaps it is offset such that with the THP alignm= ent case > > the 96K of interest ends up straddling 3 cache lines (cache line is 64K= I > > assume?), but for the unaligned case, it ends up nicely packed in 2? > > CC zuoze, please help to check this. > > Thank again. > > > > Thanks, > > Ryan > > > >> > >> 1) with efa7df3e3bb5 smaps > >> > >> ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0 > >> Size: 524300 kB > >> KernelPageSize: 4 kB > >> MMUPageSize: 4 kB > >> Rss: 2048 kB > >> Pss: 2048 kB > >> Pss_Dirty: 2048 kB > >> Shared_Clean: 0 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 2048 kB > >> Referenced: 2048 kB > >> Anonymous: 2048 kB // we have 1 anon thp > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 2048 kB > > > > Yes one 2M THP shown here. You have THP allocated. W/o commit efa7df3e3bb5 the address may be not PMD aligned (it still could be, but just not that likely), the base pages were allocated. To get an apple to apple comparison, you need to disable THP by setting /sys/kernel/mm/transparent_hugepage/enabled to madvise or never, then you will get base pages too (IIRC lmbench doesn't call MADV_HUGEPAGE). The address alignment or page size may have a negative impact to your CPU's cache and memory subsystem, for example, hw prefetcher. But I saw a slight improvement with THP on my machine. So the behavior strongly depends on the hardware. > > > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 1 > >> VmFlags: rd wr mr mw me ac > >> ffff88eff000-ffff89000000 rw-p 00000000 00:00 0 > >> Size: 1028 kB > >> KernelPageSize: 4 kB > >> MMUPageSize: 4 kB > >> Rss: 1028 kB > >> Pss: 1028 kB > >> Pss_Dirty: 1028 kB > >> Shared_Clean: 0 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 1028 kB > >> Referenced: 1028 kB > >> Anonymous: 1028 kB // another large anon > > > > This is not THP, since you only have 2M THP enabled. This will be 1M of= 4K page > > allocations + 1 4K page malloc control structure, allocated and accesse= d by > > permutation() during test setup. > > > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 0 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 0 > >> VmFlags: rd wr mr mw me ac > >> > >> and the smap_rollup > >> > >> 00400000-fffff56bd000 ---p 00000000 00:00 0 [rollup] > >> Rss: 4724 kB > >> Pss: 3408 kB > >> Pss_Dirty: 3338 kB > >> Pss_Anon: 3338 kB > >> Pss_File: 70 kB > >> Pss_Shmem: 0 kB > >> Shared_Clean: 1176 kB > >> Shared_Dirty: 420 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 3128 kB > >> Referenced: 4344 kB > >> Anonymous: 3548 kB > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 2048 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> > >> 2) without efa7df3e3bb5 smaps > >> > >> ffff9845b000-ffffb855f000 rw-p 00000000 00:00 0 > >> Size: 525328 kB > > > > This is a merged-vma version of the above 2 regions. > > > >> KernelPageSize: 4 kB > >> MMUPageSize: 4 kB > >> Rss: 1128 kB > >> Pss: 1128 kB > >> Pss_Dirty: 1128 kB > >> Shared_Clean: 0 kB > >> Shared_Dirty: 0 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 1128 kB > >> Referenced: 1128 kB > >> Anonymous: 1128 kB // only large anon > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 0 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >> THPeligible: 1 > >> VmFlags: rd wr mr mw me ac > >> > >> and the smap_rollup, > >> > >> 00400000-ffffca5dc000 ---p 00000000 00:00 0 [rollup] > >> Rss: 2600 kB > >> Pss: 1472 kB > >> Pss_Dirty: 1388 kB > >> Pss_Anon: 1388 kB > >> Pss_File: 84 kB > >> Pss_Shmem: 0 kB > >> Shared_Clean: 1000 kB > >> Shared_Dirty: 424 kB > >> Private_Clean: 0 kB > >> Private_Dirty: 1176 kB > >> Referenced: 2220 kB > >> Anonymous: 1600 kB > >> KSM: 0 kB > >> LazyFree: 0 kB > >> AnonHugePages: 0 kB > >> ShmemPmdMapped: 0 kB > >> FilePmdMapped: 0 kB > >> Shared_Hugetlb: 0 kB > >> Private_Hugetlb: 0 kB > >> Swap: 0 kB > >> SwapPss: 0 kB > >> Locked: 0 kB > >>