From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B616C04FFE for ; Wed, 8 May 2024 13:42:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 209F86B00A6; Wed, 8 May 2024 09:42:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1BA576B00A7; Wed, 8 May 2024 09:42:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 081BC6B00A8; Wed, 8 May 2024 09:42:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DE2016B00A6 for ; Wed, 8 May 2024 09:42:00 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8D41412098B for ; Wed, 8 May 2024 13:42:00 +0000 (UTC) X-FDA: 82095342000.24.56A7592 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf23.hostedemail.com (Postfix) with ESMTP id 6DD14140003 for ; Wed, 8 May 2024 13:41:58 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715175718; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3/1kTK0V+w/pg4F/BIEwIRA/cpcZ53dWDPUvhLaSoPs=; b=QVRoIj6Cgq7tiKTAZ3ph+84s6+JFW1XBqdbLKMxpiclg5fIcIYKw2OpXkJepVeDxTb72wF /seg4ptimKj7kuEuzr8nw+Oi9NPv84KC7VUa4bzHugEVFKu3EvxQmxnGRyac2tab+zJ+7j YiAonbZQlah4RZOLcVngv7gsoV42dYM= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf23.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715175718; a=rsa-sha256; cv=none; b=2Gc+bVlX/rAps8CeRV4JwBD8OYpM3zYC2U91jHHYUl4umdO3CKHH5PuvXrc87cEpXDvMUm lImZvjn/qSvrhetCg18x7V3cJotyh8D26HIiaAUGVWd3uPkjBULrkIzNrn+6uRkwpqTo7z k2OZAZ3taBrDDisfaxyfuDBOGAWVJb8= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 26B771007; Wed, 8 May 2024 06:42:23 -0700 (PDT) Received: from [10.57.67.194] (unknown [10.57.67.194]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E9D523F6A8; Wed, 8 May 2024 06:41:55 -0700 (PDT) Message-ID: <42733616-5f8f-47ce-a861-b00701069221@arm.com> Date: Wed, 8 May 2024 14:41:54 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries Content-Language: en-GB To: Kefeng Wang , Yang Shi Cc: David Hildenbrand , Matthew Wilcox , Yang Shi , riel@surriel.com, cl@linux.com, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Ze Zuo References: <20231214223423.1133074-1-yang@os.amperecomputing.com> <1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.com> <1dc9a561-55f7-4d65-8b86-8a40fa0e84f9@arm.com> <6016c0e9-b567-4205-8368-1f1c76184a28@huawei.com> <2c14d9ad-c5a3-4f29-a6eb-633cdf3a5e9e@redhat.com> <2b403705-a03c-4cfe-8d95-b38dd83fca52@arm.com> <281aebf1-0bff-4858-b479-866eb05b9e94@huawei.com> <219cb8e3-a77b-468b-9d69-0b3e386f93f6@arm.com> <7d8c43b6-b1ef-428e-9d6a-1c26284feb26@huawei.com> From: Ryan Roberts In-Reply-To: <7d8c43b6-b1ef-428e-9d6a-1c26284feb26@huawei.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: i35qxuya5wci8shq3puk7p1sxra9y3se X-Rspamd-Queue-Id: 6DD14140003 X-Rspamd-Server: rspam10 X-Rspam-User: X-HE-Tag: 1715175718-194099 X-HE-Meta: U2FsdGVkX18qk0ZJFlkYPLT8ucRspBgReYJ/s/kqGXjlH0qEuFxAJQWNZ8lHExPhgA1u9owQU4HutseFPmwsue3yP5Vk5Q5YlXsllAoQqJ4Mhm5QRxC+Me17zOl2TaLkRry2pvGjBsQggItdjx8y8KbV0c8tPdgqi744ya2CYYOIStmy8tQItOSvyZOxcFwX1UbR02rw7sndIQUDw35OBAVAk6enus6c4Xt7oh2J8y/hZYBHk+9uK9zgSFeAh5e2jpMhE5eH+Z8RksfRfxPuCVWQJ4BaH8CrkUdQYYceBoaQSXBgm2EYPS/A9N39aVg5uWaDP1drZHLoVmrxfT2LfsCgAICVc4h/aO/ldu/j+RfSamyMSSMQS3ivOe2olYab9d2VIPca5w13XrfYxFXzrLhaEmr3KlG6Kzwxg4CuOwESs2GDQLw6PENiazCavUeiP6wuChzhtxFWJXXceusiQHmvRK2E87MJl1xi12byC5/xa4388Z8UaEoN+A4sXyC+lPqpeDwDu4IiKFsOUPhbs0PbHZISsSeFjx7VRtt4h6LViiKLbnYt/5V9Qf3MwNhLqBaes4MqGkS9LrkhdTiqALbyIR4fxlDcwKvN7lY3MuqEAidIlzf9/Skl+3aJTTl+w3VMWVtJ1Ez0uJFcwK2dSdjJHjHUcU6LROTyF20l4xxVVO0mHeAb8DZYpvTAVF5D5ppfVQjpjaH6hFtFYcwUJD1VduFl7Of9ZsYA4tFvQWGjIJGeyLveYIsaXICAxmbgN6iLoODbrCmomZArDW8rIsqHB0c9Py5V4xOdWZbCrGWYSOutWqC5t2pZLH6w1pM57jn4iVFVXJ2Wyj20a8gxbuCQqQO62DVOtT3bRRziS+64kl7pRpWciAaZDov97PUp+VIdMuNg8/Qm9TIPk9LHmjDAOimdwgCbfJFlVA48xzdyrvXmnlBQZVjEk+ULIm2KezO/11xeMUXiFdxrhYI rqBZndqB D29zNS7DrX8duztNgA+XYkOkCiHmY6aA+m3XHrvHuv7f54QK3+zXRbt5OzQ8gm2Ln/qwQyAl3Ho3dOKePg3Yh3S1uOviv+u1puOaS5WiT2tw1B8ZujvR7qRS4hAXGpZkhMOwKW0FD03o+/MWO5x4eU+21iC0gqUSr5SRssnHjcTZgst/U3x0OqOb4PSCkRQ+x5cM5ia2YwzzWXXxOVB1wF195LceIIjH3A3ocAJ0nsviXrlPBtWODyOVD6ndrDw512ESs X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 08/05/2024 14:37, Kefeng Wang wrote: > > > On 2024/5/8 16:36, Ryan Roberts wrote: >> On 08/05/2024 08:48, Kefeng Wang wrote: >>> >>> >>> On 2024/5/8 1:17, Yang Shi wrote: >>>> On Tue, May 7, 2024 at 8:53 AM Ryan Roberts wrote: >>>>> >>>>> On 07/05/2024 14:53, Kefeng Wang wrote: >>>>>> >>>>>> >>>>>> On 2024/5/7 19:13, David Hildenbrand wrote: >>>>>>> >>>>>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 >>>>>>>> >>>>>>>>> suggest. If you want to try something semi-randomly; it might be useful >>>>>>>>> to rule >>>>>>>>> out the arm64 contpte feature. I don't see how that would be interacting >>>>>>>>> here if >>>>>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable >>>>>>>>> with >>>>>>>>> ARM64_CONTPTE (needs EXPERT) at compile time. >>>>>>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, >>>>>>>> but will have a try. >>>>>> >>>>>> After ARM64_CONTPTE disabled, memory read latency is similar with >>>>>> ARM64_CONTPTE >>>>>> enabled(default 6.9-rc7), still larger than align anon reverted. >>>>> >>>>> OK thanks for trying. >>>>> >>>>> Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and >>>>> using >>>>> that for all sizes. That will presumably be considered "large" by malloc and >>>>> will be allocated using mmap. So with the patch, it will be 2M aligned. >>>>> Without >>>>> it, it probably won't. I'm still struggling to understand why not aligning >>>>> it in >>>>> virtual space would make it more performant though... >>>> >>>> Yeah, I'm confused too. >>> Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference >>> for anon shows below, and all attached. >> >> OK, a bit more insight; during initialization, the test makes 2 big malloc >> calls; the first is 1M and the second is 512M+8K. I think those 2 are the 2 vmas >> below (malloc is adding an extra page to the allocation, presumably for >> management structures). >> >> With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-aligned >> address. All of its pages are populated (see permutation() which allocates and >> writes it) but none of them are THP (obviously - its only 1M and THP is only >> enabled for 2M). But the 512M region is allocated at a THP-aligned address. And >> the first page is populated with a THP (presumably faulted when malloc writes to >> its control structure page before the application even sees the allocated buffer. >> >> In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP-aligned, >> and therefore the 512M region abutts the 1M region and the vmas are merged in >> the kernel. So we end up with the single 525328 kB region. There are no THPs >> allocated here (due to alignment constraiints) so we end up with the 1M region >> fully populated with 4K pages as before, and only the malloc control page plus >> the parts of the buffer that the application actually touches being populated in >> the 512M region. >> >> As far as I can tell, the application never touches the 1M region during the >> test so it should be cache-cold. It only touches the first part of the 512M >> buffer it needs for the size of the test (96K here?). The latency of allocating >> the THP will have been consumed during test setup so I doubt we are seeing that >> in the test results and I don't see why having a single TLB entry vs 96K/4K=24 >> entries would make it slower. > > It is strange, and even more stranger, I got another machine(old machine > 128 core and the new machine 96 core, but with same L1/L2 cache size > per-core), the new machine without this issue, will contact with our > hardware team, maybe some different configurations(prefetch or some > other similar hardware configurations) , thank for all the suggestion > and analysis! No problem, you're welcome! > > >> >> It would be interesting to know the address that gets returned from malloc for >> the 512M region if that's possible to get (in both cases)? I guess it is offset >> into the first page. Perhaps it is offset such that with the THP alignment case >> the 96K of interest ends up straddling 3 cache lines (cache line is 64K I >> assume?), but for the unaligned case, it ends up nicely packed in 2? > > CC zuoze, please help to check this. > > Thank again. >> >> Thanks, >> Ryan >> >>> >>> 1) with efa7df3e3bb5 smaps >>> >>> ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0 >>> Size:             524300 kB >>> KernelPageSize:        4 kB >>> MMUPageSize:           4 kB >>> Rss:                2048 kB >>> Pss:                2048 kB >>> Pss_Dirty:          2048 kB >>> Shared_Clean:          0 kB >>> Shared_Dirty:          0 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      2048 kB >>> Referenced:         2048 kB >>> Anonymous:          2048 kB // we have 1 anon thp >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:      2048 kB >> >> Yes one 2M THP shown here. >> >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> THPeligible:           1 >>> VmFlags: rd wr mr mw me ac >>> ffff88eff000-ffff89000000 rw-p 00000000 00:00 0 >>> Size:               1028 kB >>> KernelPageSize:        4 kB >>> MMUPageSize:           4 kB >>> Rss:                1028 kB >>> Pss:                1028 kB >>> Pss_Dirty:          1028 kB >>> Shared_Clean:          0 kB >>> Shared_Dirty:          0 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      1028 kB >>> Referenced:         1028 kB >>> Anonymous:          1028 kB // another large anon >> >> This is not THP, since you only have 2M THP enabled. This will be 1M of 4K page >> allocations + 1 4K page malloc control structure, allocated and accessed by >> permutation() during test setup. >> >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:         0 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> THPeligible:           0 >>> VmFlags: rd wr mr mw me ac >>> >>> and the smap_rollup >>> >>> 00400000-fffff56bd000 ---p 00000000 00:00 0 [rollup] >>> Rss:                4724 kB >>> Pss:                3408 kB >>> Pss_Dirty:          3338 kB >>> Pss_Anon:           3338 kB >>> Pss_File:             70 kB >>> Pss_Shmem:             0 kB >>> Shared_Clean:       1176 kB >>> Shared_Dirty:        420 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      3128 kB >>> Referenced:         4344 kB >>> Anonymous:          3548 kB >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:      2048 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> >>> 2) without efa7df3e3bb5 smaps >>> >>> ffff9845b000-ffffb855f000 rw-p 00000000 00:00 0 >>> Size:             525328 kB >> >> This is a merged-vma version of the above 2 regions. >> >>> KernelPageSize:        4 kB >>> MMUPageSize:           4 kB >>> Rss:                1128 kB >>> Pss:                1128 kB >>> Pss_Dirty:          1128 kB >>> Shared_Clean:          0 kB >>> Shared_Dirty:          0 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      1128 kB >>> Referenced:         1128 kB >>> Anonymous:          1128 kB // only large anon >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:         0 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>> THPeligible:           1 >>> VmFlags: rd wr mr mw me ac >>> >>> and the smap_rollup, >>> >>> 00400000-ffffca5dc000 ---p 00000000 00:00 0 [rollup] >>> Rss:                2600 kB >>> Pss:                1472 kB >>> Pss_Dirty:          1388 kB >>> Pss_Anon:           1388 kB >>> Pss_File:             84 kB >>> Pss_Shmem:             0 kB >>> Shared_Clean:       1000 kB >>> Shared_Dirty:        424 kB >>> Private_Clean:         0 kB >>> Private_Dirty:      1176 kB >>> Referenced:         2220 kB >>> Anonymous:          1600 kB >>> KSM:                   0 kB >>> LazyFree:              0 kB >>> AnonHugePages:         0 kB >>> ShmemPmdMapped:        0 kB >>> FilePmdMapped:         0 kB >>> Shared_Hugetlb:        0 kB >>> Private_Hugetlb:       0 kB >>> Swap:                  0 kB >>> SwapPss:               0 kB >>> Locked:                0 kB >>>