From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFD5AC54E58 for ; Wed, 13 Mar 2024 01:17:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F12A46B0150; Tue, 12 Mar 2024 21:17:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EC2226B0151; Tue, 12 Mar 2024 21:17:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D3BF76B0153; Tue, 12 Mar 2024 21:17:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id BF1466B0150 for ; Tue, 12 Mar 2024 21:17:29 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6D08712105D for ; Wed, 13 Mar 2024 01:17:29 +0000 (UTC) X-FDA: 81890253018.01.69EE98B Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.20]) by imf12.hostedemail.com (Postfix) with ESMTP id 7DB714000C for ; Wed, 13 Mar 2024 01:17:27 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=TU+j76oW; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710292647; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+yJ/AFkHp9kESjqXq3yaUeoIjsYNv8q+TugsPsnpgNM=; b=ccX8JnHfx5CtfekqyRMefvsEs69OScSkFpH4HiNMqKAFOryaYdCP1NOj7A1XXtcZ7U5GTt 2UNd+hEb65z4c/Q6p6CRrEZ1Sh68WOKOpWPtGdgMry4paT3RFXcstAEc2sv8upQXqJQId6 cpL9nPgDMNKzeO/qZm9jMiA8PzuYSdk= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=TU+j76oW; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710292647; a=rsa-sha256; cv=none; b=myqWtXq090MIu5jMKSfABLXySJTcon9yxkVqJ2ReACfgHbWIVCteYlYYz8XlFBx+1kb9IG x83b0QUAI83HRV7oGAKkMnarG1Hs5CCudjwJND4inrPgRvRGNAja/mtmlG1EOrmyrBtVA3 mUaa24HZWsgXMj5aECw6cD4UA7KYg7s= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1710292647; x=1741828647; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=xEGbOoNHJhjnq2qS3X1dBXqH+R61Z8ym2sUNKikj51A=; b=TU+j76oWktLX19SVo2oRqbFebtP8aKndPe+9EpGS69AeO4E9vg+zb2js ldgNrWZa6+9dXYtOVRJSqRevdxmwWFmb8STMI9xSbIpo9+6GffN0e1Cts 4Zb5OuIDFW6WC5Xi0GHj22C4cy0C1p6tWoo7ugg7ChDThcIa8xgsRp0na ZZt3RzoL5oqKye5rejTa6pojatwLAAZxiaIbmMmPoMnaz9akTlXG48Il7 R5VIDpzpznHbEhg086QaoFiOXF9MDJXkfYpN+xQkM3FTT5QA4nF0TbzIa 8F5NbeHoAf68TFF8voTeC+CQajrFSuzr70/e3os2pJoNUQbP5W+I4HMyv Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11011"; a="4903257" X-IronPort-AV: E=Sophos;i="6.07,119,1708416000"; d="scan'208";a="4903257" Received: from fmviesa006.fm.intel.com ([10.60.135.146]) by orvoesa112.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2024 18:17:26 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,119,1708416000"; d="scan'208";a="11802862" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Mar 2024 18:17:23 -0700 From: "Huang, Ying" To: Ryan Roberts Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li , , Subject: Re: [PATCH v4 0/6] Swap-out mTHP without splitting In-Reply-To: <2fbc83bf-2e51-40fa-8865-499911ba8102@arm.com> (Ryan Roberts's message of "Tue, 12 Mar 2024 13:56:58 +0000") References: <20240311150058.1122862-1-ryan.roberts@arm.com> <878r2n516c.fsf@yhuang6-desk2.ccr.corp.intel.com> <28914585-80bd-4308-b3aa-dd0dbb2cb201@arm.com> <2fbc83bf-2e51-40fa-8865-499911ba8102@arm.com> Date: Wed, 13 Mar 2024 09:15:28 +0800 Message-ID: <87zfv32aq7.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 7DB714000C X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: quzmj1wjj5a4nw4r5ngb66aqegg57i7h X-HE-Tag: 1710292647-470673 X-HE-Meta: U2FsdGVkX19YtFKY7M4OwlN9ehUP1ULmrUH/yKYILP95RNjU/dBfAThi2rLMRuC36UOBy79wMwoqrhwIWwV3JXUkiDyyop1lgaYss+Lir4FC3weUYoAaBHt+XQmLgho0cOeINTjh/VLFnOHxCPIFQiBSZvMoNgp5rNy15GCRX6DD7vq9zA3DtE/kUe8XxPpe8wPXKu2wSDIxZtx/HP1NA3eweaXu2r7qaXRH4udmOaRbWv4tgp0vx3bf5Zw4Feyo9fQfN1XJyj3Eho6fdoCqN1XgUipFTl2hJLlJ7J2QJiaKCp9N7TbQv5R5TXrFZPJU7k+q1Rft6LkiyvCQlkTl5e3vKghXayksCrE88b8NjaGK6v4inRY12JHOSYtJK8LZdPyhDqOQN/QUgt0jrGVHgYxASlPWSQMpfxurwjDcJEhaPGECXJqINkbaixtEeP5TKtW5egRfDl4Opb8KG71Ar1sQJDbXGFgTI0DaPf1FtslsN08+5TlIpam8TcpxGoHC2s0Y5Ys45OaJY3/QMVHj+6yqY24s1DmWP7CJj+s9dR7VZ/Qf2a1bXBuRLZPF0IMG7nk+zAlGZCf6lxG9b5rM3mq4nyTdDU6Aye8ViDBHu5xzcqeB8EyyGwurzNMFlzbtSbVXVfXeD3p7NZA4ukN5sRXa5xH5iVKtjNY1ZpgXHbbQwZkOPpZCNuWIChK+fTxnLWAhQt3rSc6Flyr5L0ZgeZ6udNzaxb+1qNpzU9+bueDULLUg3FIvH3lRwocwFKfotfqwku821z4fbDv7dItgPd7rW94lLkfu5sTM0x15oFxYghB77K7JJG0MqW4GIjaGDI6vAL+r7/nfbT3N2ABIgXLkODjKaSwl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ryan Roberts writes: > On 12/03/2024 08:49, Ryan Roberts wrote: >> On 12/03/2024 08:01, Huang, Ying wrote: >>> Ryan Roberts writes: >>> >>>> Hi All, >>>> >>>> This series adds support for swapping out multi-size THP (mTHP) without needing >>>> to first split the large folio via split_huge_page_to_list_to_order(). It >>>> closely follows the approach already used to swap-out PMD-sized THP. >>>> >>>> There are a couple of reasons for swapping out mTHP without splitting: >>>> >>>> - Performance: It is expensive to split a large folio and under extreme memory >>>> pressure some workloads regressed performance when using 64K mTHP vs 4K >>>> small folios because of this extra cost in the swap-out path. This series >>>> not only eliminates the regression but makes it faster to swap out 64K mTHP >>>> vs 4K small folios. >>>> >>>> - Memory fragmentation avoidance: If we can avoid splitting a large folio >>>> memory is less likely to become fragmented, making it easier to re-allocate >>>> a large folio in future. >>>> >>>> - Performance: Enables a separate series [4] to swap-in whole mTHPs, which >>>> means we won't lose the TLB-efficiency benefits of mTHP once the memory has >>>> been through a swap cycle. >>>> >>>> I've done what I thought was the smallest change possible, and as a result, this >>>> approach is only employed when the swap is backed by a non-rotating block device >>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded >>>> that this is sufficient. >>>> >>>> >>>> Performance Testing >>>> =================== >>>> >>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The >>>> VM is set up with a 35G block ram device as the swap device and the test is run >>>> from inside a memcg limited to 40G memory. I've then run `usemem` from >>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've >>>> repeated everything 6 times and taken the mean performance improvement relative >>>> to 4K page baseline: >>>> >>>> | alloc size | baseline | + this series | >>>> | | v6.6-rc4+anonfolio | | >>>> |:-----------|--------------------:|--------------------:| >>>> | 4K Page | 0.0% | 1.4% | >>>> | 64K THP | -14.6% | 44.2% | >>>> | 2M THP | 87.4% | 97.7% | >>>> >>>> So with this change, the 64K swap performance goes from a 15% regression to a >>>> 44% improvement. 4K and 2M swap improves slightly too. >>> >>> I don't understand why the performance of 2M THP improves. The swap >>> entry allocation becomes a little slower. Can you provide some >>> perf-profile to root cause it? >> >> I didn't post the stdev, which is quite large (~10%), so that may explain some >> of it: >> >> | kernel | mean_rel | std_rel | >> |:---------|-----------:|----------:| >> | base-4K | 0.0% | 5.5% | >> | base-64K | -14.6% | 3.8% | >> | base-2M | 87.4% | 10.6% | >> | v4-4K | 1.4% | 3.7% | >> | v4-64K | 44.2% | 11.8% | >> | v4-2M | 97.7% | 13.3% | >> >> Regardless, I'll do some perf profiling and post results shortly. > > I did a lot more runs (24 for each config) and meaned them to try to remove the > noise in the measurements. It's now only showing a 4% improvement for 2M. So I > don't think the 2M improvement is real: > > | kernel | mean_rel | std_rel | > |:---------|-----------:|----------:| > | base-4K | 0.0% | 3.2% | > | base-64K | -9.1% | 10.1% | > | base-2M | 88.9% | 6.8% | > | v4-4K | 0.5% | 3.1% | > | v4-64K | 44.7% | 8.3% | > | v4-2M | 93.3% | 7.8% | > > Looking at the perf data, the only thing that sticks out is that a big chunk of > time is spent in during contpte_convert(), called as a result of > try_to_unmap_one(). This is present in both the before and after configs. > > This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is > being split during shrink_folio_list() with TTU_SPLIT_HUGE_PMD, meaning the > THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one > which means the contpte block needs to be unfolded. I think try_to_unmap_one() > could potentially be optimized to batch unmap a contiguously mapped folio and > avoid this unfold. But that would be an independent and separate piece of work. Thanks for more data and detailed explanation. -- Best Regards, Huang, Ying