From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C02DC54E58 for ; Mon, 18 Mar 2024 06:19:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A30136B0085; Mon, 18 Mar 2024 02:19:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9DFB06B0087; Mon, 18 Mar 2024 02:19:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8CF176B0088; Mon, 18 Mar 2024 02:19:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 7EB5C6B0085 for ; Mon, 18 Mar 2024 02:19:00 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2C0F91A06CB for ; Mon, 18 Mar 2024 06:19:00 +0000 (UTC) X-FDA: 81909156840.29.018A057 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by imf14.hostedemail.com (Postfix) with ESMTP id 5F627100008 for ; Mon, 18 Mar 2024 06:18:57 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BeHe7+kD; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710742738; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8KFI2ggq+zTOq8MlHIoLjc6D84dm/5f39A1+J+Yaxcw=; b=TUCRYJ89RQ4e6NhDX/Yrsuox+NoVv4u7hLldRn6X+g9aJaqs6xXDgchtqGIwqzduXv4Rhz ghLvh8HdQOoUui3JLVw9qpugf9lo2vnW3KDUMaJxuN2gLvum7L5c33xTVBfdli7XhLwS/g r0W04+eLtiNn7VEWKoHh70zvesGBE/8= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=BeHe7+kD; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710742738; a=rsa-sha256; cv=none; b=dLc905+K77DWDIgNHMc84yZFGZPMfaoLHJ7TESLg/M8tkpVBtRkcwG6Fqg18LjszGmml+6 D+HFkYCoC7lwCE1SUMQTco+KMcQ7yC8mJYSSftQcYjzR9Vv4tB3/+C3GZSYFH/GArzGRSA a443ICHW+plkFFNzLgL+01FbrRgThmM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1710742738; x=1742278738; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=/uxO/1t0VF+da2zFcsIPhImm0cli6tNt05gP60hyl1k=; b=BeHe7+kD9IxdluRg7uyYE4fsdyy6Bu4+5tIWjNja++x6mAhG4H075QJ7 wfsupcYqb6xlC0K16ho+IKYxYqqHhV1EdBQBJX9SiN7dSGij1DEDFXw3g aORDSM/NWEH2Wvv/RjByWFdawAympTvVpjqDWuV6hd2K6Qp3c+LKdH+vE cplf927pBSe+o20vguRQO9uiJypT3WJCUNJgBJLCdnHnD3G2aKIsg3CsX bFtdGgkUmFrKmOD5TZVymt3A5ixLCG1CqyzYn5XcieA7XueG1W13D5gwD lHSIRY8ubvlRXuXFpVSLpCGXVdUjVmkwMBcWB6xx2K+HASo7r9W/npmye Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11016"; a="16087123" X-IronPort-AV: E=Sophos;i="6.07,134,1708416000"; d="scan'208";a="16087123" Received: from orviesa006.jf.intel.com ([10.64.159.146]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2024 23:18:56 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.07,134,1708416000"; d="scan'208";a="13754252" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Mar 2024 23:18:53 -0700 From: "Huang, Ying" To: Baolin Wang Cc: akpm@linux-foundation.org, , , , , <21cnbao@gmail.com>, , , Subject: Re: [RFC PATCH v2] mm: support multi-size THP numa balancing In-Reply-To: <903bf13fc3e68b8dc1f256570d78b55b2dd9c96f.1710493587.git.baolin.wang@linux.alibaba.com> (Baolin Wang's message of "Fri, 15 Mar 2024 17:18:14 +0800") References: <903bf13fc3e68b8dc1f256570d78b55b2dd9c96f.1710493587.git.baolin.wang@linux.alibaba.com> Date: Mon, 18 Mar 2024 14:16:59 +0800 Message-ID: <871q88vzc4.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 5F627100008 X-Stat-Signature: 7ftqzrpxpgp5pism79gd8aj7a1sa3xm9 X-Rspam-User: X-HE-Tag: 1710742737-8324 X-HE-Meta: U2FsdGVkX1+YNU1CyySkcyTQKVcriiS8t3M32zvnKYPSI9D4YarB+vaBFk04OIqOFh1Yb/hMTXbmV/EGzQ0yfVu6+aAgsoLDkJAiqLoJHFi+CobbodhKrK1h1EC2Xr6SiTZI0sGp7sc6VFUwlsr+umxDEgIWSfkwqyPj/xpo0PfAGAN/7c92gX830GFQgeT7lGl2dbMJtwEgst4HSGcEce5MzepyKtOClDrFwXRpSETegUpuIujWFaEXg1RTdDdmb2VLxqc5i68ivbxy9GIA1bj02KdEwVPJPR46z/X9LiZ+jtNYVtfQ4P6ZiufjVJ+WuLIeGKrYGsk7PujXsyeJ55RJTW2mHLKNL3Np47sbwpG4n6V36HaeMTe0xwfJ5PEcMrGh88w0TBUxvfCvbLATLCMXgJX89sYiPn8s0DEgwC+bO5n+nIhe6P62b0tMzTO69lJgIL0mFuQcms2fp72kHN9+MJdErlVBYWIko9BvVOujc+oSNjqChbwF44g1dT5uuj2Cjzzp5Q3FAlmoK5RPMmxsmc3dxdCvZh78xC91iB6dfgdM2PVUYpCpMn6uqtunWCsfKvX4ZCRsrbEHIUxQOFuItm3WZviz1nAf910uV0nwCmz4J56k+Ix8N14h6BkX6ymeF05w2XV/gV/mZwoWryD1Q4/EzwEJrGKdAmN/Krzihz20tt+a9VzGUp1cKTC5QRZ1d1Dpb3kCZXka6QFpA6MwHeTk7rIPDJfmqxGy2gAWBW9QDateElLZkBMoz15YD/yjWFDRMx1CaFd/HR/QiZLIKznkgTHjAsH0UoAeND8OFE30ICLzuFRTMkUKuLhLKcfF2c+OO/CJ8sdfw5+GxX7AAGUQ0minRx4abQ3oRENJPEHDymzMqOoIDy6lKipaOC/ikeeI9BqgSu/PGBp8Ra4tGpgWMSHEbpfKHOhjxRaCQRN6tqkbDyDq5Qmzh0CZK/mi7GWUY2zKSKMrEwI ibpK5Vwt uJGjwBwTHhhzFF2mcU5dmddsuR/UDSh0DfmRC9jJwxPHxe+fKYZkwjHOGKcFEnLwtlkOnkPgVyKU6SZ45vBjelD2WYK3sJgkBYSuEueMguBUHaecSSRCAWo4O0l6wgMfDfq+i0YXoXNfAgAH2QTgRs9Q7DNFiU7O838hoI66hK1fVodfNGoy9xkvx4wTmgzQ9BjtAuxVkjf2wxqcMPzjtIdOd/RFmsjIg2eJcOJ5qLef7D8MkTSp05iMEGvBjpi6XJk/Qy9HNV600ct8wgWINB3aKv3J2F6khZYdL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Baolin Wang writes: > Now the anonymous page allocation already supports multi-size THP (mTHP), > but the numa balancing still prohibits mTHP migration even though it is an > exclusive mapping, which is unreasonable. Thus let's support the exclusive > mTHP numa balancing firstly. > > Allow scanning mTHP: > Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section > pages") skips shared CoW pages' NUMA page migration to avoid shared data > segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to > NUMA-migrate COW pages that have other uses") change to use page_count() > to avoid GUP pages migration, that will also skip the mTHP numa scaning. > Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP > issue, although there is still a GUP race, the issue seems to have been > resolved by commit 80d47f5de5e3. Meanwhile, use the folio_estimated_sharers() > to skip shared CoW pages though this is not a precise sharers count. To > check if the folio is shared, ideally we want to make sure every page is > mapped to the same process, but doing that seems expensive and using > the estimated mapcount seems can work when running autonuma benchmark. > > Allow migrating mTHP: > As mentioned in the previous thread[1], large folios are more susceptible > to false sharing issues, leading to pages ping-pong back and forth during > numa balancing, which is currently hard to resolve. Therefore, as a start to > support mTHP numa balancing, only exclusive mappings are allowed to perform > numa migration to avoid the false sharing issues with large folios. Similarly, > use the estimated mapcount to skip shared mappings, which seems can work > in most cases (?), and we've used folio_estimated_sharers() to skip shared > mappings in migrate_misplaced_folio() for numa balancing, seems no real > complaints. IIUC, folio_estimated_sharers() cannot identify multi-thread applications. If some mTHP is shared by multiple threads in one process, how to deal with that? For example, I think that we should avoid to migrate on the first fault for mTHP in should_numa_migrate_memory(). More thoughts? Can we add a field in struct folio for mTHP to count hint page faults from the same node? -- Best Regards, Huang, Ying > Performance data: > Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum > Base: 2024-3-15 mm-unstable branch > Enable mTHP=64K to run autonuma-benchmark > > Base without the patch: > numa01 > 222.97 > numa01_THREAD_ALLOC > 115.78 > numa02 > 13.04 > numa02_SMT > 14.69 > > Base with the patch: > numa01 > 125.36 > numa01_THREAD_ALLOC > 44.58 > numa02 > 9.22 > numa02_SMT > 7.46 > > [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/ > Signed-off-by: Baolin Wang > --- > Changes from RFC v1: > - Add some preformance data per Huang, Ying. > - Allow mTHP scanning per David Hildenbrand. > - Avoid sharing mapping for numa balancing to avoid false sharing. > - Add more commit message. > --- > mm/memory.c | 9 +++++---- > mm/mprotect.c | 3 ++- > 2 files changed, 7 insertions(+), 5 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index f2bc6dd15eb8..b9d5d88c5a76 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5059,7 +5059,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > int last_cpupid; > int target_nid; > pte_t pte, old_pte; > - int flags = 0; > + int flags = 0, nr_pages = 0; > > /* > * The pte cannot be used safely until we verify, while holding the page > @@ -5089,8 +5089,8 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > if (!folio || folio_is_zone_device(folio)) > goto out_map; > > - /* TODO: handle PTE-mapped THP */ > - if (folio_test_large(folio)) > + /* Avoid large folio false sharing */ > + if (folio_test_large(folio) && folio_estimated_sharers(folio) > 1) > goto out_map; > > /* > @@ -5112,6 +5112,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > flags |= TNF_SHARED; > > nid = folio_nid(folio); > + nr_pages = folio_nr_pages(folio); > /* > * For memory tiering mode, cpupid of slow memory page is used > * to record page access time. So use default value. > @@ -5148,7 +5149,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > > out: > if (nid != NUMA_NO_NODE) > - task_numa_fault(last_cpupid, nid, 1, flags); > + task_numa_fault(last_cpupid, nid, nr_pages, flags); > return 0; > out_map: > /* > diff --git a/mm/mprotect.c b/mm/mprotect.c > index f8a4544b4601..f0b9c974aaae 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* Also skip shared copy-on-write pages */ > if (is_cow_mapping(vma->vm_flags) && > - folio_ref_count(folio) != 1) > + (folio_maybe_dma_pinned(folio) || > + folio_estimated_sharers(folio) > 1)) > continue; > > /*