From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF8CCCD1283 for ; Mon, 1 Apr 2024 03:47:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1635D6B0092; Sun, 31 Mar 2024 23:47:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 114176B0095; Sun, 31 Mar 2024 23:47:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1CFB6B0098; Sun, 31 Mar 2024 23:47:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D4FD96B0092 for ; Sun, 31 Mar 2024 23:47:35 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8905F801C7 for ; Mon, 1 Apr 2024 03:47:35 +0000 (UTC) X-FDA: 81959578470.02.834EB3B Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf18.hostedemail.com (Postfix) with ESMTP id 16AE71C000B for ; Mon, 1 Apr 2024 03:47:32 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711943253; a=rsa-sha256; cv=none; b=7iWohMC1vYwMxTuRdD1npO+zJ7YwBmf/C0yFRg5T8ZIstzx4opy8DRq/N/C2qErBAeifTl VSlEvzA4e5y5fB3+AnheNik3YM/nLrP94YK3Le8+wxMaFwdcQr6Apwvq/ngWQ3Tbsd3UkZ gYv1dP321/YOS3D2H82WXl47sm1MpY0= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf18.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711943253; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mOLD771K0WKLzbx8hRdsRxuZd//89Utsz1MTg57kJtc=; b=e2O1NbXSQ712x+JGKCi5z5BljOdk7peWH6pSRn7V8XMDO6JaVoNv6xT5l1z3TMvI47pQF4 k2WTFOlxqJGfmTA5at46pW6Oj1m3qUpHM0JuRruBASf+oNr4U92OlTU3P13sGEtrtistbn gocqEVuHCgjstbHkVldhcbEWEQ+Mc2k= Received: from mail.maildlp.com (unknown [172.19.88.163]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4V7H2P6NSnz1h50d; Mon, 1 Apr 2024 11:44:45 +0800 (CST) Received: from dggpemm100001.china.huawei.com (unknown [7.185.36.93]) by mail.maildlp.com (Postfix) with ESMTPS id 66D81180060; Mon, 1 Apr 2024 11:47:29 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm100001.china.huawei.com (7.185.36.93) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Mon, 1 Apr 2024 11:47:28 +0800 Message-ID: <86236280-a811-4d5a-a480-5006226ae379@huawei.com> Date: Mon, 1 Apr 2024 11:47:28 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: support multi-size THP numa balancing Content-Language: en-US To: Baolin Wang , CC: , , , , <21cnbao@gmail.com>, , , References: From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemm100001.china.huawei.com (7.185.36.93) X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 16AE71C000B X-Stat-Signature: tsiz6gy8n7jopasstx6th8qqug6ddznw X-Rspam-User: X-HE-Tag: 1711943252-491827 X-HE-Meta: U2FsdGVkX196lZNsOwcTPmouLeLquNY3BAmccQPtVN3ktSVG8ewpkj1/FhZ+kC0umYM/Cloivr3V9vI805p5qBnRJlF9lZ4+QLSH0Uk8WWUpt2buY1RF55Gn44dGyV/em2pfUF5giI8aKlp2XeAfQpnxkewCfF13aouC0V+9k2SG6cRIijp3LyN1kKe6N3GOr0Fqo5FVbXE+FDgcwu9wRbS8wEDxNtc18+TDJekgpECBMMbdY/LIIMlv0YCRaymzHxUOrYXu81MLA8f8a3X3dmWthE0DWYPeT3khwYa1qeAIE0jgAN2Hx8yvEViOKCjdZxhq/JhMX/Gj+E17YHB9Msy/KA2SLnP+ZujqZdd8GtZrAjb4AHUEfNMnge6W/LigmV/jRKhuP7HfBLjiMjtrjox22nGdtKxsBbf2R17bUqNLMYYVUFRiyIULG0a/XXQ28Gzy0r9RfNyyVdN6hIRs9ORahg4Z5DcPJKPu0pB1a+ysrwv6o6x3kSaDCzO/Nqx2Q2539xytHzKnNGS/YPmRgaJC0urVtESQ6EG6fwYWvQAE9lQ8VxUc+I3hu12P/9XQa0gRUfHogXjs7rPnJS4bDNeVIP5ZHQyKp4Vci4kD6WkVQ18cO4an9ZMIAsjR7mED3tIbVKCzsJjEkOLJA3ePRAY4uvPx7uwb5njXfGaa9ODn8CCUxSCQ0ehKL7TusG1f885pDWtIPCbz/DON21LyGC9oTT9bRC3rYjTDw547uzrBFPUsd6+p2V7s+/o+yshcz7D5sR68g/z5TSZC+VO2UNvgsROZu82rADNdSDLX4WouAZYqPB+PKJ62I37dL9jJV9lC6mk/gcZB09sN4TsgeFxuHG8KpOqTmN5uMGSspe8aRSaXwgRR9kWNS2xtF81dHq1y25cDYAWuYxp7EtX6AjwIbm4RRtsFu4bm/zdoi089DaUII8tr7xtioYSg/sxwIcvBlRngAWnlF8dGqXi 0C/ya54i Zm1PgjZVjJ6ZaTXn5fqTSAhGF74dcp3f6ACpa5lbVPl1Kf4D4chXEze/+xryuaFnFOsfUWInUkI0pJQFTz1TB6Zai83zwi7hDBKIBbEKNzxq6OHeQRBTZvtHmoMdzRcl/nRhpI5B2D636ga11grFwYlJnO+nY7oTz0NhaN9Yq7bPqnKQ0oayDzbmZPiwHhCqlAZiIF8VUZUBUF0saTYFX84pWXEIQlHw5q4fcCxMxuLWVexgLJCOkUixzbd2SaoEZ1YAX3YDL0zVJsAcZ9oEr1WWyD2T783Nm20bRHu6b7aul1jdpmjnJOx0Mrg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/3/29 14:56, Baolin Wang wrote: > Now the anonymous page allocation already supports multi-size THP (mTHP), > but the numa balancing still prohibits mTHP migration even though it is an > exclusive mapping, which is unreasonable. > > Allow scanning mTHP: > Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section > pages") skips shared CoW pages' NUMA page migration to avoid shared data > segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to > NUMA-migrate COW pages that have other uses") change to use page_count() > to avoid GUP pages migration, that will also skip the mTHP numa scaning. > Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP > issue, although there is still a GUP race, the issue seems to have been > resolved by commit 80d47f5de5e3. Meanwhile, use the folio_likely_mapped_shared() > to skip shared CoW pages though this is not a precise sharers count. To > check if the folio is shared, ideally we want to make sure every page is > mapped to the same process, but doing that seems expensive and using > the estimated mapcount seems can work when running autonuma benchmark. > > Allow migrating mTHP: > As mentioned in the previous thread[1], large folios (including THP) are > more susceptible to false sharing issues among threads than 4K base page, > leading to pages ping-pong back and forth during numa balancing, which is > currently not easy to resolve. Therefore, as a start to support mTHP numa > balancing, we can follow the PMD mapped THP's strategy, that means we can > reuse the 2-stage filter in should_numa_migrate_memory() to check if the > mTHP is being heavily contended among threads (through checking the CPU id > and pid of the last access) to avoid false sharing at some degree. Thus, > we can restore all PTE maps upon the first hint page fault of a large folio > to follow the PMD mapped THP's strategy. In the future, we can continue to > optimize the NUMA balancing algorithm to avoid the false sharing issue with > large folios as much as possible. > > Performance data: > Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum > Base: 2024-03-25 mm-unstable branch > Enable mTHP to run autonuma-benchmark > > mTHP:16K > Base Patched > numa01 numa01 > 224.70 143.48 > numa01_THREAD_ALLOC numa01_THREAD_ALLOC > 118.05 47.43 > numa02 numa02 > 13.45 9.29 > numa02_SMT numa02_SMT > 14.80 7.50 > > mTHP:64K > Base Patched > numa01 numa01 > 216.15 114.40 > numa01_THREAD_ALLOC numa01_THREAD_ALLOC > 115.35 47.41 > numa02 numa02 > 13.24 9.25 > numa02_SMT numa02_SMT > 14.67 7.34 > > mTHP:128K > Base Patched > numa01 numa01 > 205.13 144.45 > numa01_THREAD_ALLOC numa01_THREAD_ALLOC > 112.93 41.88 > numa02 numa02 > 13.16 9.18 > numa02_SMT numa02_SMT > 14.81 7.49 > > [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/ > Signed-off-by: Baolin Wang > --- > mm/memory.c | 57 +++++++++++++++++++++++++++++++++++++++++++-------- > mm/mprotect.c | 3 ++- > 2 files changed, 51 insertions(+), 9 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index c30fb4b95e15..2aca19e4fbd8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5068,16 +5068,56 @@ static void numa_rebuild_single_mapping(struct vm_fault *vmf, struct vm_area_str > update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); > } > > +static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct vm_area_struct *vma, > + struct folio *folio, pte_t fault_pte, bool ignore_writable) > +{ > + int nr = pte_pfn(fault_pte) - folio_pfn(folio); > + unsigned long start = max(vmf->address - nr * PAGE_SIZE, vma->vm_start); > + unsigned long end = min(vmf->address + (folio_nr_pages(folio) - nr) * PAGE_SIZE, vma->vm_end); > + pte_t *start_ptep = vmf->pte - (vmf->address - start) / PAGE_SIZE; > + bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); > + unsigned long addr; > + > + /* Restore all PTEs' mapping of the large folio */ > + for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) { > + pte_t pte, old_pte; > + pte_t ptent = ptep_get(start_ptep); > + bool writable = false; > + > + if (!pte_present(ptent) || !pte_protnone(ptent)) > + continue; > + > + if (pfn_folio(pte_pfn(ptent)) != folio) > + continue; > + > + if (!ignore_writable) { > + ptent = pte_modify(ptent, vma->vm_page_prot); > + writable = pte_write(ptent); > + if (!writable && pte_write_upgrade && > + can_change_pte_writable(vma, addr, ptent)) > + writable = true; > + } > + > + old_pte = ptep_modify_prot_start(vma, addr, start_ptep); > + pte = pte_modify(old_pte, vma->vm_page_prot); > + pte = pte_mkyoung(pte); > + if (writable) > + pte = pte_mkwrite(pte, vma); > + ptep_modify_prot_commit(vma, addr, start_ptep, old_pte, pte); > + update_mmu_cache_range(vmf, vma, addr, start_ptep, 1); Maybe pass "unsigned long address, pte_t *ptep" to numa_rebuild_single_mapping(), then, just call it here. > + } > +} > + > static vm_fault_t do_numa_page(struct vm_fault *vmf) > { > struct vm_area_struct *vma = vmf->vma; > struct folio *folio = NULL; > int nid = NUMA_NO_NODE; > - bool writable = false; > + bool writable = false, ignore_writable = false; > int last_cpupid; > int target_nid; > pte_t pte, old_pte; > - int flags = 0; > + int flags = 0, nr_pages; > > /* > * The pte cannot be used safely until we verify, while holding the page > @@ -5107,10 +5147,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > if (!folio || folio_is_zone_device(folio)) > goto out_map; > > - /* TODO: handle PTE-mapped THP */ > - if (folio_test_large(folio)) > - goto out_map; > - > /* > * Avoid grouping on RO pages in general. RO pages shouldn't hurt as > * much anyway since they can be in shared cache state. This misses > @@ -5130,6 +5166,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > flags |= TNF_SHARED; > > nid = folio_nid(folio); > + nr_pages = folio_nr_pages(folio); > /* > * For memory tiering mode, cpupid of slow memory page is used > * to record page access time. So use default value. > @@ -5146,6 +5183,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > } > pte_unmap_unlock(vmf->pte, vmf->ptl); > writable = false; > + ignore_writable = true; > > /* Migrate to the requested node */ > if (migrate_misplaced_folio(folio, vma, target_nid)) { > @@ -5166,14 +5204,17 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) > > out: > if (nid != NUMA_NO_NODE) > - task_numa_fault(last_cpupid, nid, 1, flags); > + task_numa_fault(last_cpupid, nid, nr_pages, flags); > return 0; > out_map: > /* > * Make it present again, depending on how arch implements > * non-accessible ptes, some can allow access by kernel mode. > */ > - numa_rebuild_single_mapping(vmf, vma, writable); > + if (folio && folio_test_large(folio)) initialize nr_pages and then call if (nr_pages > 1) > + numa_rebuild_large_mapping(vmf, vma, folio, pte, ignore_writable); > + else > + numa_rebuild_single_mapping(vmf, vma, writable); > pte_unmap_unlock(vmf->pte, vmf->ptl); > goto out; > } > diff --git a/mm/mprotect.c b/mm/mprotect.c > index f8a4544b4601..94878c39ee32 100644 > --- a/mm/mprotect.c > +++ b/mm/mprotect.c > @@ -129,7 +129,8 @@ static long change_pte_range(struct mmu_gather *tlb, > > /* Also skip shared copy-on-write pages */ > if (is_cow_mapping(vma->vm_flags) && > - folio_ref_count(folio) != 1) > + (folio_maybe_dma_pinned(folio) || > + folio_likely_mapped_shared(folio))) > continue; > > /*