From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3DD46C04FFE for ; Thu, 9 May 2024 01:47:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C8ECA6B0095; Wed, 8 May 2024 21:47:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C3D5C6B0096; Wed, 8 May 2024 21:47:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B04CB6B0098; Wed, 8 May 2024 21:47:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 929F06B0095 for ; Wed, 8 May 2024 21:47:21 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 417001C0FF8 for ; Thu, 9 May 2024 01:47:21 +0000 (UTC) X-FDA: 82097169882.28.99D3DDD Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by imf27.hostedemail.com (Postfix) with ESMTP id 351714000A for ; Thu, 9 May 2024 01:47:16 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf27.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715219239; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qkqFT41pHSls0XZlp7Hu8Qxox+mKYxi13KfK8okCMCo=; b=eQBtsiQu7ytspbdFI7lnpf01ne4Kf7mpSUWDf+K5Tx904THD5zXDvrXUBaUDaxM9e6BcQ6 8Umf9UNtSO/NBTfltc0fwzffzTxjkwqa545Vn0+UmZFsCPnyGtH/pPQvoTVsGE9KBi+/Xp 3quDERyFTIQTPaiYwXI7Zuy2DKhSCQo= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf27.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715219239; a=rsa-sha256; cv=none; b=Aqv06dPRKDAeAUMb+mKOrIDmCy5IEVLHbwNGQxBqT09Ovzv+MZnp8cOpQXLqTxIMyyvBz+ rWn48sJtjKCw4LDcl0ndpz9t3Io8z2RlbgvcWE8mqW+V8Hc+IsDIwf7MmaBZPLHzhgC5U4 VfiijPx6uFioWgSEExRpHMSCALYP2AA= Received: from mail.maildlp.com (unknown [172.19.163.174]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4VZZZ14vYxzNwN3; Thu, 9 May 2024 09:44:25 +0800 (CST) Received: from dggpemm100001.china.huawei.com (unknown [7.185.36.93]) by mail.maildlp.com (Postfix) with ESMTPS id 1EA491400D7; Thu, 9 May 2024 09:47:13 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Thu, 9 May 2024 09:47:12 +0800 Message-ID: <51d48776-ac72-432a-b768-92e7fa0ecd4b@huawei.com> Date: Thu, 9 May 2024 09:47:12 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries Content-Language: en-US To: Yang Shi CC: Ryan Roberts , David Hildenbrand , Matthew Wilcox , Yang Shi , , , , , , Ze Zuo References: <20231214223423.1133074-1-yang@os.amperecomputing.com> <1dc9a561-55f7-4d65-8b86-8a40fa0e84f9@arm.com> <6016c0e9-b567-4205-8368-1f1c76184a28@huawei.com> <2c14d9ad-c5a3-4f29-a6eb-633cdf3a5e9e@redhat.com> <2b403705-a03c-4cfe-8d95-b38dd83fca52@arm.com> <281aebf1-0bff-4858-b479-866eb05b9e94@huawei.com> <219cb8e3-a77b-468b-9d69-0b3e386f93f6@arm.com> <7d8c43b6-b1ef-428e-9d6a-1c26284feb26@huawei.com> From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemm100001.china.huawei.com (7.185.36.93) X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 351714000A X-Stat-Signature: utfkpy91n71n8rssch61nur1damqaafu X-Rspam-User: X-HE-Tag: 1715219236-234006 X-HE-Meta: U2FsdGVkX19cRptUjIwR1LXgmy2do3reSREGvqtGeVxdCFH5dCk75iaw1uqvWKu4QvLTJZXytMM2unz7Cptv+/N37AampwR8TUUYsZZTmG2SmjaCxi4pPs+xw5YmCDcTjjOY75KNQfqJ99EEj+4AWPC+QoQdc0XOQqvg0ktYOCQmDRzqYN+c8T0GhKUEwzv9CmMsDJ11rmqLWJNt1wMO4kPR5jGtY1E+J7no7fcK4Jcl3w9fdp915NawWsIBHiSG6dreB5FNtjYsmrstplgmcB/tCInQVS2FyJyO7/SfFr/1V+APsyWvYPVgPij8SeiDmpkA6OZJ9xJaUbxBlLsjGlBkHkq/jQ2Qje0lJyQvjYcUvs+TGHQEa+KHFgUZG5QCRmwv0f9qyqMTOVm11SePp6RM/ynC0KeckKjO4tUZfWRw5xPdIbPrsq8ThRb1swNkwG9kaPM4F1DoDEG/lMglTkrkWvSeGc91whcvKKQyHzeWEdjTUB5slZ+fBUxZbn1FFni9HQrOy9NQpJqzLVyGrlF5nVNDtWWCjXUbKdLTl45jfwCTUrbKISK6kVj18d0tfqXg+24+IkjBUwcPac41oi07cuscymPxi0BcGNgD2csYaBgLPG2wTxPYERScNmVr5viUPfpSgjKfm9vTBVa3NgnfIbFAS+AFzoyEIHbAdrJtG2/AEiW3pgDe2kMw6CUsh9GfK3lpsRr718tWvWEHJ60USALSglqthpO4NNCSyVeAde9FtvQhutygiGkenjVodaheYLyyPs5WZD6CLEA7U0W8AK8uIIMGeRBqvYZeKJvxzKP+KYkA5Rankk4pssLGcf/C6n4gfITlEU3Yanbnk0/Br/b58tfhDhYlgtM4du2lcttvC8ZyVRQzbSdulic0sLhXDpPpluGPlOwske2KL5RZp10cwf7WNcfRGKQnMA4v+PG8aKKRTY4QlmMSlaPZBWCSfyxcbkJvu/6Qw6s aOm0f1vG X2XnaX7MC7Y3FjoUqwar7wkvK0IZR0Q59VrcgmR/XPBuX5RwTP7k7EgKgKOLIrWsIRDt2Xtsr7JQyB8e+m0l8aBfb0VLxa2MbosbCg4Fb8JXz13wgv8JVtEFtOkivaSHf7pxOtogGTSW/8UC2PZPAdceW/qkyrCpd/bumEnAZ/KBkJQNAjGk/SRORKPUgBqD0vnV2VkUgmx/rsD9c8swctWTOPB5e+k2dkl6hr/dI+DO7DZ0Kd9D4Vj1Jl2QnsmsKIaaCBb3YOpTgYnrWk3IFskMkxxd3MpeH3GucfAnuqjJwiQ4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/5/8 23:25, Yang Shi wrote: > On Wed, May 8, 2024 at 6:37 AM Kefeng Wang wrote: >> >> >> >> On 2024/5/8 16:36, Ryan Roberts wrote: >>> On 08/05/2024 08:48, Kefeng Wang wrote: >>>> >>>> >>>> On 2024/5/8 1:17, Yang Shi wrote: >>>>> On Tue, May 7, 2024 at 8:53 AM Ryan Roberts wrote: >>>>>> >>>>>> On 07/05/2024 14:53, Kefeng Wang wrote: >>>>>>> >>>>>>> >>>>>>> On 2024/5/7 19:13, David Hildenbrand wrote: >>>>>>>> >>>>>>>>> https://github.com/intel/lmbench/blob/master/src/lat_mem_rd.c#L95 >>>>>>>>> >>>>>>>>>> suggest. If you want to try something semi-randomly; it might be useful >>>>>>>>>> to rule >>>>>>>>>> out the arm64 contpte feature. I don't see how that would be interacting >>>>>>>>>> here if >>>>>>>>>> mTHP is disabled (is it?). But its new for 6.9 and arm64 only. Disable with >>>>>>>>>> ARM64_CONTPTE (needs EXPERT) at compile time. >>>>>>>>> I don't enabled mTHP, so it should be not related about ARM64_CONTPTE, >>>>>>>>> but will have a try. >>>>>>> >>>>>>> After ARM64_CONTPTE disabled, memory read latency is similar with ARM64_CONTPTE >>>>>>> enabled(default 6.9-rc7), still larger than align anon reverted. >>>>>> >>>>>> OK thanks for trying. >>>>>> >>>>>> Looking at the source for lmbench, its malloc'ing (512M + 8K) up front and using >>>>>> that for all sizes. That will presumably be considered "large" by malloc and >>>>>> will be allocated using mmap. So with the patch, it will be 2M aligned. Without >>>>>> it, it probably won't. I'm still struggling to understand why not aligning it in >>>>>> virtual space would make it more performant though... >>>>> >>>>> Yeah, I'm confused too. >>>> Me too, I get a smaps[_rollup] for 0.09375M size, the biggest difference >>>> for anon shows below, and all attached. >>> >>> OK, a bit more insight; during initialization, the test makes 2 big malloc >>> calls; the first is 1M and the second is 512M+8K. I think those 2 are the 2 vmas >>> below (malloc is adding an extra page to the allocation, presumably for >>> management structures). >>> >>> With efa7df3e3bb5 applied, the 1M allocation is allocated at a non-THP-aligned >>> address. All of its pages are populated (see permutation() which allocates and >>> writes it) but none of them are THP (obviously - its only 1M and THP is only >>> enabled for 2M). But the 512M region is allocated at a THP-aligned address. And >>> the first page is populated with a THP (presumably faulted when malloc writes to >>> its control structure page before the application even sees the allocated buffer. >>> >>> In contrast, when efa7df3e3bb5 is reverted, neither of the vmas are THP-aligned, >>> and therefore the 512M region abutts the 1M region and the vmas are merged in >>> the kernel. So we end up with the single 525328 kB region. There are no THPs >>> allocated here (due to alignment constraiints) so we end up with the 1M region >>> fully populated with 4K pages as before, and only the malloc control page plus >>> the parts of the buffer that the application actually touches being populated in >>> the 512M region. >>> >>> As far as I can tell, the application never touches the 1M region during the >>> test so it should be cache-cold. It only touches the first part of the 512M >>> buffer it needs for the size of the test (96K here?). The latency of allocating >>> the THP will have been consumed during test setup so I doubt we are seeing that >>> in the test results and I don't see why having a single TLB entry vs 96K/4K=24 >>> entries would make it slower. >> >> It is strange, and even more stranger, I got another machine(old machine >> 128 core and the new machine 96 core, but with same L1/L2 cache size >> per-core), the new machine without this issue, will contact with our >> hardware team, maybe some different configurations(prefetch or some >> other similar hardware configurations) , thank for all the suggestion >> and analysis! > > Yes, the benchmark result strongly relies on cache and memory > subsystem. See the below analysis. > >> >> >>> >>> It would be interesting to know the address that gets returned from malloc for >>> the 512M region if that's possible to get (in both cases)? I guess it is offset >>> into the first page. Perhaps it is offset such that with the THP alignment case >>> the 96K of interest ends up straddling 3 cache lines (cache line is 64K I >>> assume?), but for the unaligned case, it ends up nicely packed in 2? >> >> CC zuoze, please help to check this. >> >> Thank again. >>> >>> Thanks, >>> Ryan >>> >>>> >>>> 1) with efa7df3e3bb5 smaps >>>> >>>> ffff68e00000-ffff88e03000 rw-p 00000000 00:00 0 >>>> Size: 524300 kB >>>> KernelPageSize: 4 kB >>>> MMUPageSize: 4 kB >>>> Rss: 2048 kB >>>> Pss: 2048 kB >>>> Pss_Dirty: 2048 kB >>>> Shared_Clean: 0 kB >>>> Shared_Dirty: 0 kB >>>> Private_Clean: 0 kB >>>> Private_Dirty: 2048 kB >>>> Referenced: 2048 kB >>>> Anonymous: 2048 kB // we have 1 anon thp >>>> KSM: 0 kB >>>> LazyFree: 0 kB >>>> AnonHugePages: 2048 kB >>> >>> Yes one 2M THP shown here. > > You have THP allocated. W/o commit efa7df3e3bb5 the address may be not > PMD aligned (it still could be, but just not that likely), the base > pages were allocated. To get an apple to apple comparison, you need to > disable THP by setting /sys/kernel/mm/transparent_hugepage/enabled to > madvise or never, then you will get base pages too (IIRC lmbench > doesn't call MADV_HUGEPAGE). Yes, we tested no THP(disable by sysfs) before, no different w/ or w/o this efa7df3e3bb5. > > The address alignment or page size may have a negative impact to your > CPU's cache and memory subsystem, for example, hw prefetcher. But I > saw a slight improvement with THP on my machine. So the behavior > strongly depends on the hardware. > I hope this efa7df3e3bb5 could improve performance so I backport it into our kernel, but found the above issue, and same result when retest with the 6.9-rc7, since different hardware show different results, we will test more hardware and try to contact with hardware team, thanks for your help.