From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ADC8FCD1283 for ; Mon, 1 Apr 2024 09:47:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 298C56B0085; Mon, 1 Apr 2024 05:47:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2495C6B0088; Mon, 1 Apr 2024 05:47:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1124E6B0089; Mon, 1 Apr 2024 05:47:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id E8A556B0085 for ; Mon, 1 Apr 2024 05:47:17 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 75D6A14061C for ; Mon, 1 Apr 2024 09:47:17 +0000 (UTC) X-FDA: 81960484914.23.F3BE30C Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100]) by imf23.hostedemail.com (Postfix) with ESMTP id 29FE714000B for ; Mon, 1 Apr 2024 09:47:14 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=Awl4+MIC; spf=pass (imf23.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711964835; a=rsa-sha256; cv=none; b=HOND1KXLYr1afG7osAfZuVWqk+UD3Gcv4ODThXHWXebv1Tjjrjh6CITe9bIGFlXkrFvpto KttFYQuhJtYy0ZIl4EnEzRVavYMEBxGjaXCXtDpJlvmZvHyc8iqhx14YZvjEXEvuYGGre3 tIJ+cLi1/WIfYp2Y1juz5LmJLq9J7GY= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=Awl4+MIC; spf=pass (imf23.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711964835; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zuuew83k+LQb/Y+TvxE+cu+nqYDNqpIa+/3fszfLWmM=; b=TPuGOQEF7NtYwKqJrGdEkkOlKggzz8uF0c3oEUpGaX9DBFYFdbKpJRCrZjim6RuRtpo8UQ zloYrPeQt0qPY+VdJlzIE6b7bgrbECMGIvmyzckwCFeZxwTuqLa9wDp3IR6VzGa7+L2xoP jbXidbRhwRzuyen8YA24D0TQFyB5jHk= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1711964831; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=zuuew83k+LQb/Y+TvxE+cu+nqYDNqpIa+/3fszfLWmM=; b=Awl4+MIC7vHo2KsGR9YuWO+Ihmh+ephU+renNld7yNT4dl2Ce9kIgSBkLdLcm4rJoecnk1YsaM3BvUnlKa6pa1tMJsoOsJi1WvDLjjtuv2k4S5SIsjlRaP/+jo6XhoeZqtWPXBrdFroqmDmNqQ5xERcMk62mZsFNwGveB8nGMVA= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R121e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046060;MF=baolin.wang@linux.alibaba.com;NM=1;PH=DS;RN=10;SR=0;TI=SMTPD_---0W3jcQaE_1711964829; Received: from 30.97.56.92(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0W3jcQaE_1711964829) by smtp.aliyun-inc.com; Mon, 01 Apr 2024 17:47:10 +0800 Message-ID: <5f17af68-721b-42bc-88f6-ee8fc527789d@linux.alibaba.com> Date: Mon, 1 Apr 2024 17:47:09 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/2] mm: support multi-size THP numa balancing To: Kefeng Wang , akpm@linux-foundation.org Cc: david@redhat.com, mgorman@techsingularity.net, jhubbard@nvidia.com, ying.huang@intel.com, 21cnbao@gmail.com, ryan.roberts@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <86236280-a811-4d5a-a480-5006226ae379@huawei.com> From: Baolin Wang In-Reply-To: <86236280-a811-4d5a-a480-5006226ae379@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 29FE714000B X-Stat-Signature: f6kcco4rzbcoguxprx3wo9ntqpb6du6u X-Rspam-User: X-HE-Tag: 1711964834-895549 X-HE-Meta: U2FsdGVkX1/lY13pyr2pp03FF47Xjs+qD/sbiDbWFLDLZcYgy+1EJIcep5L4QQ2w1TKKZrsKj21Qdc4zeyAMhdXjyYKG5nfxeXd5fKnymTWlDRQZnjzquuCx1kODBhx+VbeCX3FhE4aPqGu1vwn7PV5J/ofAwI2BrU5VJoiaH9hq1tSdxyVGKCNC5P62khKX2pIUyWucP7Ucb2XbvjsFuFwEKAM1ZYaLVHXPNsz6Q9DKCJ2toj/Y3AtRg0R9LRxXI95NEpaui715gsjnK0RrInrd1mCGgdepu6VHmaCsQjlxCTxCds9q81018I3+8c8RkVRMMgldnmBfypCWp9XreIaxl8K6VK90Y25XEBxlKO7PQyOahWdPUl865BzjFzNR/V2ksey9esHGd81Yov7PjRODadDHrmRwDWMKgy2M8l5akC/iBNPazuPQWzWZ5stTx5ZDxJNo54RiwqqJL6e7w+FhrW0kfpG9jCGYP0/IPor0vmGtIuK5hpJdWHkUNTZXVm3AxEX/NZk5eFHlA6rBGN91vqhqkKRt6aGi288Lq8IFJ7WKj2WdnOf0XKKdAgmt6+tn52A09xzIODhYPKAIu3fIIopYVpXuBVwLO6t+zV9tE1HZFfxNLaAsdmzSc2XLZfsOoxbt4K4P6QyHUyEEMhfC1jifnWAgcZ+OeuYiuiiMRu8C6Fbcw1lh1lbj78jCqVvGd7QjCec5Dep3ramB6PD/GBFy+ZSkpcbSSLmYSF3muoGuTJjD/VKhaRFdWNBIRD6EUqfhXx4hzyR9H25Q3koME4uY+wP1o/Y9t7R7BnAjhQColxA37cBffRulVIWEBSYYObLCohwFJHdxOWxbyD70UNnXDzPxHr79zkAC3jep9cXJoF0HALKcCvlY6BJEGXP7elWnrojuhdur732XfvPCzzrIgX0YHeXubJkIXiwU+9Up/y99JFZvljSoC4SmInswctMZya459LZS0KE MHvME6S8 b1ziB8AQKyx71O8aYvy+d9JPkrTv4PzI+yeLQx4zTyhvQLIZCRhFArg5CC24wzsyvPKBsQbN2EeaY3lI5QjAEUosladwb22fFq4kylXy96hIjUkzEtTpQ25gRUrKU3/3aLdRhMcClrRPQhuus6fjOO6NRqrXPIio+tdm5SYM6jLxQk7Tdcs/RmNtdHuuUGYUQ1GPoXwuzKMCu801Gbu0DWksoV176MPqp1kU7AwikTl66j23KZWCpXbRmmWh2oA4f/u1VBu3oemYi1DN+u3uwXgMM8+5SK3Poc8bxLFNfkSNHBMyTh/RAUeuMsyqrwz2YPQqovs+7hXC8QFnubsqk3Dmoejejza/kwaeSrsUuNsFmu6YOhe271Vj9dOfR9sabLzoS76N0vviAHt5J6qA1QQD2FYwZtJ/ia8tksMuu5ol6hNQg+wMwF2hqLAiDXrwxi48OMHWjWykqgN09AFwg4r9LSQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/4/1 11:47, Kefeng Wang wrote: > > > On 2024/3/29 14:56, Baolin Wang wrote: >> Now the anonymous page allocation already supports multi-size THP (mTHP), >> but the numa balancing still prohibits mTHP migration even though it >> is an >> exclusive mapping, which is unreasonable. >> >> Allow scanning mTHP: >> Commit 859d4adc3415 ("mm: numa: do not trap faults on shared data section >> pages") skips shared CoW pages' NUMA page migration to avoid shared data >> segment migration. In addition, commit 80d47f5de5e3 ("mm: don't try to >> NUMA-migrate COW pages that have other uses") change to use page_count() >> to avoid GUP pages migration, that will also skip the mTHP numa scaning. >> Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP >> issue, although there is still a GUP race, the issue seems to have been >> resolved by commit 80d47f5de5e3. Meanwhile, use the >> folio_likely_mapped_shared() >> to skip shared CoW pages though this is not a precise sharers count. To >> check if the folio is shared, ideally we want to make sure every page is >> mapped to the same process, but doing that seems expensive and using >> the estimated mapcount seems can work when running autonuma benchmark. >> >> Allow migrating mTHP: >> As mentioned in the previous thread[1], large folios (including THP) are >> more susceptible to false sharing issues among threads than 4K base page, >> leading to pages ping-pong back and forth during numa balancing, which is >> currently not easy to resolve. Therefore, as a start to support mTHP numa >> balancing, we can follow the PMD mapped THP's strategy, that means we can >> reuse the 2-stage filter in should_numa_migrate_memory() to check if the >> mTHP is being heavily contended among threads (through checking the >> CPU id >> and pid of the last access) to avoid false sharing at some degree. Thus, >> we can restore all PTE maps upon the first hint page fault of a large >> folio >> to follow the PMD mapped THP's strategy. In the future, we can >> continue to >> optimize the NUMA balancing algorithm to avoid the false sharing issue >> with >> large folios as much as possible. >> >> Performance data: >> Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum >> Base: 2024-03-25 mm-unstable branch >> Enable mTHP to run autonuma-benchmark >> >> mTHP:16K >> Base                Patched >> numa01                numa01 >> 224.70                143.48 >> numa01_THREAD_ALLOC        numa01_THREAD_ALLOC >> 118.05                47.43 >> numa02                numa02 >> 13.45                9.29 >> numa02_SMT            numa02_SMT >> 14.80                7.50 >> >> mTHP:64K >> Base                Patched >> numa01                numa01 >> 216.15                114.40 >> numa01_THREAD_ALLOC        numa01_THREAD_ALLOC >> 115.35                47.41 >> numa02                numa02 >> 13.24                9.25 >> numa02_SMT            numa02_SMT >> 14.67                7.34 >> >> mTHP:128K >> Base                Patched >> numa01                numa01 >> 205.13                144.45 >> numa01_THREAD_ALLOC        numa01_THREAD_ALLOC >> 112.93                41.88 >> numa02                numa02 >> 13.16                9.18 >> numa02_SMT            numa02_SMT >> 14.81                7.49 >> >> [1] >> https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/ >> Signed-off-by: Baolin Wang >> --- >>   mm/memory.c   | 57 +++++++++++++++++++++++++++++++++++++++++++-------- >>   mm/mprotect.c |  3 ++- >>   2 files changed, 51 insertions(+), 9 deletions(-) >> >> diff --git a/mm/memory.c b/mm/memory.c >> index c30fb4b95e15..2aca19e4fbd8 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -5068,16 +5068,56 @@ static void numa_rebuild_single_mapping(struct >> vm_fault *vmf, struct vm_area_str >>       update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); >>   } >> +static void numa_rebuild_large_mapping(struct vm_fault *vmf, struct >> vm_area_struct *vma, >> +                       struct folio *folio, pte_t fault_pte, bool >> ignore_writable) >> +{ >> +    int nr = pte_pfn(fault_pte) - folio_pfn(folio); >> +    unsigned long start = max(vmf->address - nr * PAGE_SIZE, >> vma->vm_start); >> +    unsigned long end = min(vmf->address + (folio_nr_pages(folio) - >> nr) * PAGE_SIZE, vma->vm_end); >> +    pte_t *start_ptep = vmf->pte - (vmf->address - start) / PAGE_SIZE; >> +    bool pte_write_upgrade = vma_wants_manual_pte_write_upgrade(vma); >> +    unsigned long addr; >> + >> +    /* Restore all PTEs' mapping of the large folio */ >> +    for (addr = start; addr != end; start_ptep++, addr += PAGE_SIZE) { >> +        pte_t pte, old_pte; >> +        pte_t ptent = ptep_get(start_ptep); >> +        bool writable = false; >> + >> +        if (!pte_present(ptent) || !pte_protnone(ptent)) >> +            continue; >> + >> +        if (pfn_folio(pte_pfn(ptent)) != folio) >> +            continue; >> + >> +        if (!ignore_writable) { >> +            ptent = pte_modify(ptent, vma->vm_page_prot); >> +            writable = pte_write(ptent); >> +            if (!writable && pte_write_upgrade && >> +                can_change_pte_writable(vma, addr, ptent)) >> +                writable = true; >> +        } >> + >> +        old_pte = ptep_modify_prot_start(vma, addr, start_ptep); >> +        pte = pte_modify(old_pte, vma->vm_page_prot); >> +        pte = pte_mkyoung(pte); >> +        if (writable) >> +            pte = pte_mkwrite(pte, vma); >> +        ptep_modify_prot_commit(vma, addr, start_ptep, old_pte, pte); >> +        update_mmu_cache_range(vmf, vma, addr, start_ptep, 1); > > Maybe pass "unsigned long address, pte_t *ptep" to > numa_rebuild_single_mapping(), > then, just call it here. Yes, sounds reasonable. Will do in next version. >>   static vm_fault_t do_numa_page(struct vm_fault *vmf) >>   { >>       struct vm_area_struct *vma = vmf->vma; >>       struct folio *folio = NULL; >>       int nid = NUMA_NO_NODE; >> -    bool writable = false; >> +    bool writable = false, ignore_writable = false; >>       int last_cpupid; >>       int target_nid; >>       pte_t pte, old_pte; >> -    int flags = 0; >> +    int flags = 0, nr_pages; >>       /* >>        * The pte cannot be used safely until we verify, while holding >> the page >> @@ -5107,10 +5147,6 @@ static vm_fault_t do_numa_page(struct vm_fault >> *vmf) >>       if (!folio || folio_is_zone_device(folio)) >>           goto out_map; >> -    /* TODO: handle PTE-mapped THP */ >> -    if (folio_test_large(folio)) >> -        goto out_map; >> - >>       /* >>        * Avoid grouping on RO pages in general. RO pages shouldn't >> hurt as >>        * much anyway since they can be in shared cache state. This misses >> @@ -5130,6 +5166,7 @@ static vm_fault_t do_numa_page(struct vm_fault >> *vmf) >>           flags |= TNF_SHARED; >>       nid = folio_nid(folio); >> +    nr_pages = folio_nr_pages(folio); >>       /* >>        * For memory tiering mode, cpupid of slow memory page is used >>        * to record page access time.  So use default value. >> @@ -5146,6 +5183,7 @@ static vm_fault_t do_numa_page(struct vm_fault >> *vmf) >>       } >>       pte_unmap_unlock(vmf->pte, vmf->ptl); >>       writable = false; >> +    ignore_writable = true; >>       /* Migrate to the requested node */ >>       if (migrate_misplaced_folio(folio, vma, target_nid)) { >> @@ -5166,14 +5204,17 @@ static vm_fault_t do_numa_page(struct vm_fault >> *vmf) >>   out: >>       if (nid != NUMA_NO_NODE) >> -        task_numa_fault(last_cpupid, nid, 1, flags); >> +        task_numa_fault(last_cpupid, nid, nr_pages, flags); >>       return 0; >>   out_map: >>       /* >>        * Make it present again, depending on how arch implements >>        * non-accessible ptes, some can allow access by kernel mode. >>        */ >> -    numa_rebuild_single_mapping(vmf, vma, writable); >> +    if (folio && folio_test_large(folio)) > initialize nr_pages and then call > >     if (nr_pages > 1) Umm, IMO, folio_test_large() is more readable for me.