From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A5506C25B4F for ; Tue, 7 May 2024 08:25:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3EF576B008C; Tue, 7 May 2024 04:25:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 39EE86B00A0; Tue, 7 May 2024 04:25:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 266EB6B00A1; Tue, 7 May 2024 04:25:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0B5346B008C for ; Tue, 7 May 2024 04:25:28 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B4180120D03 for ; Tue, 7 May 2024 08:25:27 +0000 (UTC) X-FDA: 82090915494.17.F23560E Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf08.hostedemail.com (Postfix) with ESMTP id 21FD0160014 for ; Tue, 7 May 2024 08:25:24 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf08.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715070325; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=laT6QxBkDUmgZc9wRYbGU/JLOvCNGNa2f/dynvGy+MY=; b=LI4sCgDTfdaBBNXxa58qBpx4Gkl6/7KnzF9XobwPO6NO2btsCpjrAUB4mXtOIrQUbJOSLT m9dmGselkzxify0A8OOH++s5GreOiNBh85mw3kQsgl1enfV4l0qEs7c/W4u/s/xCVHLclE ogID6ed7VsKeNEfi8VOraygUjhW6S50= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf08.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715070325; a=rsa-sha256; cv=none; b=Bnt7WV2n/nuKCD3wA2MskC/DGTjUK1HiKQC6bPdbe99RuJvv9SHOlIFM3DcWcI9L/9kWK7 Rdd4E+/tID4gh+52LZh1DlwlJ4QscUTnEBvOTazIkQSZNo2kle2WS8pf0mtPOTGVwNVuJj 7xB5T3/8KQ+bDdk+rP/3M2bPid2RGls= Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4VYWX53WTqz1HBj2; Tue, 7 May 2024 16:24:05 +0800 (CST) Received: from dggpemm100001.china.huawei.com (unknown [7.185.36.93]) by mail.maildlp.com (Postfix) with ESMTPS id 03895140154; Tue, 7 May 2024 16:25:21 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Tue, 7 May 2024 16:25:20 +0800 Message-ID: Date: Tue, 7 May 2024 16:25:19 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Kefeng Wang Subject: Re: [RESEND PATCH] mm: align larger anonymous mappings on THP boundaries To: Ryan Roberts , Yang Shi CC: Matthew Wilcox , Yang Shi , , , , , , Ze Zuo References: <20231214223423.1133074-1-yang@os.amperecomputing.com> <1e8f5ac7-54ce-433a-ae53-81522b2320e1@arm.com> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemm100001.china.huawei.com (7.185.36.93) X-Stat-Signature: wijzje6f4fueff768xs4osgpcotfzuqw X-Rspamd-Queue-Id: 21FD0160014 X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1715070324-729620 X-HE-Meta: U2FsdGVkX18MDEbSlNuDLXLHa2ghCUU+SJNYh16GdYw2GzPxUl2vJqzR5VKNa/Z0wrbvJE2nTtxYwQCw0D+XdBmmAPm/bGuqoKXQztT/n6Yyu9CJAukRCaVCmwr5I6Ly8vTD3nxQuQm+7mdaWJKBnFOG6cDbKBlRJkGu+xRfLnRNIjF46+gXn36y1aFu5XwesFvl0yzhBKVTNaeRXDTw0mkT8gLeGwkHF0N2g9FXkVkDDM7MRmlPQlJ3wQ1qSgGTkSl243abXTwoRAbTdtB681XY16INLPlDE4AqUrUcXpWdc51090++oUxuHmsH2AtydK4zxJ4e92b9WmQ/kcA8zsXqxsOdnetjKSxK4CJ5uCScZ5acOXcBryPJX5Fo1lzjW8dK+cejyqyoQ8IZijveOO/7PrPz3HmhKcwAXauA4pbDi4H+3PbBNA+Xj/TQH5jLlbmFt3hBqFaWyrWc86tJ/dmdfGi2hf/WVT7/wvjcDMXXWcXI86ZN5b9erbT0RUW0xm7VFm8tGqhWL0uqaaSb0a/Wu40AnZeMgqwnzC52Lu+udCCW8qK4rFBiR27D/cztcBJ/NZpXA6R/CBQ5fCQeWLNZKb2z2nYgSNDvDmss5czXFNUe5v3aIyZ+bqohpJdsLBFHteRxjQl/XNlXWhCFgKlQyxCytliSMcybwlmhq2wImhU4yNd+cbTDs8Jn3hFotFCPaheoAzHDCXMx7bnhP5gAxrnta8WPp6boCfOGeXmICqC5KqItQBzkxvoooHC+CM65TkSqTtv7ut00Df9S70ARWXS5V0kdsGeekR18FKqf//C3oVEqiLzIFbpJU1dMfkXWrLVLe6J2cikZvEgS/tAgeTw730FMrQXqPZlEh9aDeWLgAAD77zyCKAd5Dv2f0U9bf1kBUiEy/vF6/5ph8wZ4f7vdgnZOPPRfyrEmE1sQHmaVfEQApitHzOeFeP+IUsxlwjE6BxOldmVaY+V 7PFPLH56 cdtyJ5w5fT+/Xi3cmrLHPNUhYtOp2cUEbs0cR0vgYf4ujkC3MAIghx03OQqqlm/tIp1NXD4OT8PEzFM3dXjyn+vihI3uFDgrwL7uHUpanxJlEUzsitlYUwJ4G9moswy8KY/Ckfl2wrw3Yfa9oDTWV1ICrdqZJn/Od91w5TiyNMBbd2bd7XBrmgSyqFI/RC3zJo0KIp8bMrxnY7j0pUm7q7eHCn0ZDuYtF6HT3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Ryan, Yang and all, We see another regression on arm64(no issue on x86) when test memory latency from lmbench, ./lat_mem_rd -P 1 512M 128 memory latency(smaller is better) MiB 6.9-rc7 6.9-rc7+revert 0.00049 1.539 1.539 0.00098 1.539 1.539 0.00195 1.539 1.539 0.00293 1.539 1.539 0.00391 1.539 1.539 0.00586 1.539 1.539 0.00781 1.539 1.539 0.01172 1.539 1.539 0.01562 1.539 1.539 0.02344 1.539 1.539 0.03125 1.539 1.539 0.04688 1.539 1.539 0.0625 1.540 1.540 0.09375 3.634 3.086 0.125 3.874 3.175 0.1875 3.544 3.288 0.25 3.556 3.461 0.375 3.641 3.644 0.5 4.125 3.851 0.75 4.968 4.323 1 5.143 4.686 1.5 5.309 4.957 2 5.370 5.116 3 5.430 5.471 4 5.457 5.671 6 6.100 6.170 8 6.496 6.468 -----------------------s * L1 cache = 8M, it is no big changes below 8M * * but the latency reduce a lot when revert this patch from L2 * 12 6.917 6.840 16 7.268 7.077 24 7.536 7.345 32 10.723 9.421 48 14.220 11.350 64 16.253 12.189 96 14.494 12.507 128 14.630 12.560 192 15.402 12.967 256 16.178 12.957 384 15.177 13.346 512 15.235 13.233 After quickly check the smaps, but don't find any clues, any suggestion? Thanks. On 2024/1/24 1:26, Ryan Roberts wrote: > On 23/01/2024 17:14, Yang Shi wrote: >> On Tue, Jan 23, 2024 at 1:41 AM Ryan Roberts wrote: >>> >>> On 22/01/2024 19:43, Yang Shi wrote: >>>> On Mon, Jan 22, 2024 at 3:37 AM Ryan Roberts wrote: >>>>> >>>>> On 20/01/2024 16:39, Matthew Wilcox wrote: >>>>>> On Sat, Jan 20, 2024 at 12:04:27PM +0000, Ryan Roberts wrote: >>>>>>> However, after this patch, each allocation is in its own VMA, and there is a 2M >>>>>>> gap between each VMA. This causes 2 problems: 1) mmap becomes MUCH slower >>>>>>> because there are so many VMAs to check to find a new 1G gap. 2) It fails once >>>>>>> it hits the VMA limit (/proc/sys/vm/max_map_count). Hitting this limit then >>>>>>> causes a subsequent calloc() to fail, which causes the test to fail. >>>>>>> >>>>>>> Looking at the code, I think the problem is that arm64 selects >>>>>>> ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT. But __thp_get_unmapped_area() allocates >>>>>>> len+2M then always aligns to the bottom of the discovered gap. That causes the >>>>>>> 2M hole. As far as I can see, x86 allocates bottom up, so you don't get a hole. >>>>>> >>>>>> As a quick hack, perhaps >>>>>> #ifdef ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT >>>>>> take-the-top-half >>>>>> #else >>>>>> current-take-bottom-half-code >>>>>> #endif >>>>>> >>>>>> ? >>>> >>>> Thanks for the suggestion. It makes sense to me. Doing the alignment >>>> needs to take into account this. >>>> >>>>> >>>>> There is a general problem though that there is a trade-off between abutting >>>>> VMAs, and aligning them to PMD boundaries. This patch has decided that in >>>>> general the latter is preferable. The case I'm hitting is special though, in >>>>> that both requirements could be achieved but currently are not. >>>>> >>>>> The below fixes it, but I feel like there should be some bitwise magic that >>>>> would give the correct answer without the conditional - but my head is gone and >>>>> I can't see it. Any thoughts? >>>> >>>> Thanks Ryan for the patch. TBH I didn't see a bitwise magic without >>>> the conditional either. >>>> >>>>> >>>>> Beyond this, though, there is also a latent bug where the offset provided to >>>>> mmap() is carried all the way through to the get_unmapped_area() >>>>> impelementation, even for MAP_ANONYMOUS - I'm pretty sure we should be >>>>> force-zeroing it for MAP_ANONYMOUS? Certainly before this change, for arches >>>>> that use the default get_unmapped_area(), any non-zero offset would not have >>>>> been used. But this change starts using it, which is incorrect. That said, there >>>>> are some arches that override the default get_unmapped_area() and do use the >>>>> offset. So I'm not sure if this is a bug or a feature that user space can pass >>>>> an arbitrary value to the implementation for anon memory?? >>>> >>>> Thanks for noticing this. If I read the code correctly, the pgoff used >>>> by some arches to workaround VIPT caches, and it looks like it is for >>>> shared mapping only (just checked arm and mips). And I believe >>>> everybody assumes 0 should be used when doing anonymous mapping. The >>>> offset should have nothing to do with seeking proper unmapped virtual >>>> area. But the pgoff does make sense for file THP due to the alignment >>>> requirements. I think it should be zero'ed for anonymous mappings, >>>> like: >>>> >>>> diff --git a/mm/mmap.c b/mm/mmap.c >>>> index 2ff79b1d1564..a9ed353ce627 100644 >>>> --- a/mm/mmap.c >>>> +++ b/mm/mmap.c >>>> @@ -1830,6 +1830,7 @@ get_unmapped_area(struct file *file, unsigned >>>> long addr, unsigned long len, >>>> pgoff = 0; >>>> get_area = shmem_get_unmapped_area; >>>> } else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) { >>>> + pgoff = 0; >>>> /* Ensures that larger anonymous mappings are THP aligned. */ >>>> get_area = thp_get_unmapped_area; >>>> } >>> >>> I think it would be cleaner to just zero pgoff if file==NULL, then it covers the >>> shared case, the THP case, and the non-THP case properly. I'll prepare a >>> separate patch for this. >> >> IIUC I don't think this is ok for those arches which have to >> workaround VIPT cache since MAP_ANONYMOUS | MAP_SHARED with NULL file >> pointer is a common case for creating tmpfs mapping. For example, >> arm's arch_get_unmapped_area() has: >> >> if (aliasing) >> do_align = filp || (flags & MAP_SHARED); >> >> The pgoff is needed if do_align is true. So we should just zero pgoff >> iff !file && !MAP_SHARED like what my patch does, we can move the >> zeroing to a better place. > > We crossed streams - I sent out the patch just as you sent this. My patch is > implemented as I proposed. > > I'm not sure I agree with what you are saying. The mmap man page says this: > > The contents of a file mapping (as opposed to an anonymous mapping; see > MAP_ANONYMOUS below), are initialized using length bytes starting at offset > offset in the file (or other object) referred to by the file descriptor fd. > > So that implies offset is only relavent when a file is provided. It then goes on > to say: > > MAP_ANONYMOUS > The mapping is not backed by any file; its contents are initialized to zero. > The fd argument is ignored; however, some implementations require fd to be -1 > if MAP_ANONYMOUS (or MAP_ANON) is specified, and portable applications should > ensure this. The offset argument should be zero. > > So users are expected to pass offset=0 when mapping anon memory, for both shared > and private cases. > > Infact, in the line above where you made your proposed change, pgoff is also > being zeroed for the (!file && (flags & MAP_SHARED)) case. > > >> >>> >>> >>>> >>>>> >>>>> Finally, the second test failure I reported (ksm_tests) is actually caused by a >>>>> bug in the test code, but provoked by this change. So I'll send out a fix for >>>>> the test code separately. >>>>> >>>>> >>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>>>> index 4f542444a91f..68ac54117c77 100644 >>>>> --- a/mm/huge_memory.c >>>>> +++ b/mm/huge_memory.c >>>>> @@ -632,7 +632,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, >>>>> { >>>>> loff_t off_end = off + len; >>>>> loff_t off_align = round_up(off, size); >>>>> - unsigned long len_pad, ret; >>>>> + unsigned long len_pad, ret, off_sub; >>>>> >>>>> if (off_end <= off_align || (off_end - off_align) < size) >>>>> return 0; >>>>> @@ -658,7 +658,13 @@ static unsigned long __thp_get_unmapped_area(struct file *filp, >>>>> if (ret == addr) >>>>> return addr; >>>>> >>>>> - ret += (off - ret) & (size - 1); >>>>> + off_sub = (off - ret) & (size - 1); >>>>> + >>>>> + if (current->mm->get_unmapped_area == arch_get_unmapped_area_topdown && >>>>> + !off_sub) >>>>> + return ret + size; >>>>> + >>>>> + ret += off_sub; >>>>> return ret; >>>>> } >>>> >>>> I didn't spot any problem, would you please come up with a formal patch? >>> >>> Yeah, I'll aim to post today. >> >> Thanks! >> >>> >>> > >