From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2AD7C00144 for ; Mon, 1 Aug 2022 05:33:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 303A38E0002; Mon, 1 Aug 2022 01:33:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2B2788E0001; Mon, 1 Aug 2022 01:33:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D4EC8E0002; Mon, 1 Aug 2022 01:33:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0DF6B8E0001 for ; Mon, 1 Aug 2022 01:33:47 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CF580C078C for ; Mon, 1 Aug 2022 05:33:46 +0000 (UTC) X-FDA: 79749906852.03.2EAE5BB Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf16.hostedemail.com (Postfix) with ESMTP id 7BD6E1800F0 for ; Mon, 1 Aug 2022 05:33:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1659332025; x=1690868025; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=6D3wr+/VNs174wNAHGv2JVNTAzL/t7bNU9UZbvU6p4o=; b=RMNgtK4A0S5amIa7gHdyFnj1fW3suNuymI2zHw2I8Cr0dZVYCFMkUTo0 DNa5CbDN1l0rj/NZ2QiHqXVJfSoGJwU/z6KifkliNIHmkjqDOD1oZlmdv Dw3XgiDUPImzuC/FCsrRipNIM0mkSAKOGbGAnnuCECd5rltIzs1RE+gLw nrzTrdc/MRFH/34Ilus2haVrMmpoPKynjKkXWcI3Sf3bh3zpvvIPcg+dA /e3xRrq+PHvjQqtSt11wN7t2+pIttDnTOiAaNnJuY+Fr9oEh1pu2021aw Px51JxJLYE3alxHMOpg/0s0CLns2OXePVSezUvDFwvL0pVawC6S/J5P6J g==; X-IronPort-AV: E=McAfee;i="6400,9594,10425"; a="289079573" X-IronPort-AV: E=Sophos;i="5.93,206,1654585200"; d="scan'208";a="289079573" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Jul 2022 22:33:43 -0700 X-IronPort-AV: E=Sophos;i="5.93,206,1654585200"; d="scan'208";a="577656386" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Jul 2022 22:33:40 -0700 From: "Huang, Ying" To: Peter Xu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrea Arcangeli , Andrew Morton , "Kirill A . Shutemov" , Nadav Amit , Hugh Dickins , David Hildenbrand , Vlastimil Babka , Andi Kleen Subject: Re: [PATCH RFC 0/4] mm: Remember young bit for migration entries References: <20220729014041.21292-1-peterx@redhat.com> Date: Mon, 01 Aug 2022 13:33:28 +0800 In-Reply-To: <20220729014041.21292-1-peterx@redhat.com> (Peter Xu's message of "Thu, 28 Jul 2022 21:40:37 -0400") Message-ID: <87pmhkjzo7.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=RMNgtK4A; spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.115 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659332026; a=rsa-sha256; cv=none; b=qLp048m/YSz3ou4gVo4c25xU8e53hr7k/O6o9+Q0N/C83SLSLMMHno9OAnFHrEzaUT2idS 86Pu5rCBXxBlDO2WSVdsqdF4xmAP6M3+W2vMa8GB1tiUNgYpMOjLWfXKO8vSi68piA4PwB WbsgKHZMo1TP2RBlLXgz8jPvE+g+hhs= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659332026; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=r5f+bq00gU2nnwYiAkdN8hO9VuDkXaZgntM8TMo1y7Q=; b=yhqrNDOhg9V1yz4qezlP1iViES0a4bmDptjWPvyB5/ObaukqsvLuOqsdizy1TwDUhB1DBT VZ82wNv7LcVNb0ISUZAgYbQmJpbuzbps11E2ElI45eorQsqUY+q9PjVBik9E/y+QyHTfUM RqhE7a/cIwVv2ebi2m6faNSV6ZxgtyQ= X-Stat-Signature: m5pj5cawedosfzikztdxhxy65niqibok X-Rspamd-Queue-Id: 7BD6E1800F0 Authentication-Results: imf16.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=RMNgtK4A; spf=pass (imf16.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.115 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1659332025-27034 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Peter Xu writes: > [Marking as RFC; only x86 is supported for now, plan to add a few more > archs when there's a formal version] > > Problem > ======= > > When migrate a page, right now we always mark the migrated page as old. > The reason could be that we don't really know whether the page is hot or > cold, so we could have taken it a default negative assuming that's safer. > > However that could lead to at least two problems: > > (1) We lost the real hot/cold information while we could have persisted. > That information shouldn't change even if the backing page is changed > after the migration, > > (2) There can be always extra overhead on the immediate next access to > any migrated page, because hardware MMU needs cycles to set the young > bit again (as long as the MMU supports). > > Many of the recent upstream works showed that (2) is not something trivial > and actually very measurable. In my test case, reading 1G chunk of memory > - jumping in page size intervals - could take 99ms just because of the > extra setting on the young bit on a generic x86_64 system, comparing to 4ms > if young set. > > This issue is originally reported by Andrea Arcangeli. > > Solution > ======== > > To solve this problem, this patchset tries to remember the young bit in the > migration entries and carry it over when recovering the ptes. > > We have the chance to do so because in many systems the swap offset is not > really fully used. Migration entries use swp offset to store PFN only, > while the PFN is normally not as large as swp offset and normally smaller. > It means we do have some free bits in swp offset that we can use to store > things like young, and that's how this series tried to approach this > problem. > > One tricky thing here is even though we're embedding the information into > swap entry which seems to be a very generic data structure, the number of > bits that are free is still arch dependent. Not only because the size of > swp_entry_t differs, but also due to the different layouts of swap ptes on > different archs. If my understanding were correct, max_swapfile_size() provides a mechanism to identify the available bits with swp_entry_t and swap PTE considered. We may take advantage of that? And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if L1TF mitigation is enabled. Cced Andi for confirmation. Best Regards, Huang, Ying > Here, this series requires specific arch to define an extra macro called > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this > information, the swap logic can know whether there's extra bits to use, > then it'll remember the young bits when possible. By default, it'll keep > the old behavior of keeping all migrated pages cold. > > Tests > ===== > > After the patchset applied, the immediate read access test [1] of above 1G > chunk after migration can shrink from 99ms to 4ms. The test is done by > moving 1G pages from node 0->1->0 then read it in page size jumps. > > Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and > only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz. > > Patch Layout > ============ > > Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should > also stop treating swp_offset() as PFN anymore because it can > contain more information starting from next patch. > Patch 2: The core patch to remember young bit in swap offsets. > Patch 3: A cleanup for x86 32 bits pgtable.h. > Patch 4: Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration > > Please review, thanks. > > [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c > > Peter Xu (4): > mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry > mm: Remember young bit for page migrations > mm/x86: Use SWP_TYPE_BITS in 3-level swap macros > mm/x86: Define __ARCH_SWP_OFFSET_BITS > > arch/arm64/mm/hugetlbpage.c | 2 +- > arch/x86/include/asm/pgtable-2level.h | 6 ++ > arch/x86/include/asm/pgtable-3level.h | 15 +++-- > arch/x86/include/asm/pgtable_64.h | 5 ++ > include/linux/swapops.h | 85 +++++++++++++++++++++++++-- > mm/hmm.c | 2 +- > mm/huge_memory.c | 10 +++- > mm/memory-failure.c | 2 +- > mm/migrate.c | 4 +- > mm/migrate_device.c | 2 + > mm/page_vma_mapped.c | 6 +- > mm/rmap.c | 3 +- > 12 files changed, 122 insertions(+), 20 deletions(-)