From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08C48E77198 for ; Tue, 7 Jan 2025 07:17:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E0028D0005; Tue, 7 Jan 2025 02:17:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 390C38D0001; Tue, 7 Jan 2025 02:17:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 231D48D0005; Tue, 7 Jan 2025 02:17:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 03F508D0001 for ; Tue, 7 Jan 2025 02:17:20 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 68F47807BC for ; Tue, 7 Jan 2025 07:17:20 +0000 (UTC) X-FDA: 82979799840.03.E3FDF25 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf28.hostedemail.com (Postfix) with ESMTP id 1E3EEC0009 for ; Tue, 7 Jan 2025 07:17:17 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=none; spf=pass (imf28.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736234238; a=rsa-sha256; cv=none; b=5csBiCyAWT02eEfFCLTGRGybkVZqM+WJX+71wvod1yv+uCdt/5Ppk78LgGBkum0+MEhStg O8hbtqOODy72WackFrQvESlizaKDcWQA2xOBjNBy2w+pXMWxuXDM9qTJr7MUVXlD9AO9tA FZOCSheAlQKH3zuxnci9IC3YR45yYT0= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=none; spf=pass (imf28.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736234238; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uzIz/Q9ZefH6G31KCf7qipptnsjgjY2Hfsry1/eta5k=; b=hdkiTuXuJCTPnGPHjQ61u0gWlJ9Q4gCDhEisz8x0Z/QuNldNOtGRDE3n8Uv7oel3NRdwLC z4KbrtZUsu0VpBsGs2F1zJ9x9/jytDIZB8WgCVpLI5yQjLVGbGpE8MwDQO3mNEP9f3F88l fVIxYK8hVQaSJytBlFmZ05G6DtV48j4= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0FDA51691; Mon, 6 Jan 2025 23:17:45 -0800 (PST) Received: from [10.162.43.28] (K4MQJ0H1H2.blr.arm.com [10.162.43.28]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D73393F59E; Mon, 6 Jan 2025 23:17:06 -0800 (PST) Message-ID: <171970f4-65e8-440c-8f7f-c2eb206e2bd8@arm.com> Date: Tue, 7 Jan 2025 12:47:03 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise To: Usama Arif , akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Johannes Weiner References: <20241216165105.56185-1-dev.jain@arm.com> <20241216165105.56185-8-dev.jain@arm.com> <56bf9df5-febf-4bef-966f-d4d71365a18d@gmail.com> Content-Language: en-US From: Dev Jain In-Reply-To: <56bf9df5-febf-4bef-966f-d4d71365a18d@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 1E3EEC0009 X-Stat-Signature: gtc33heqgpcepkbdjtrmuh71fjaz8oyp X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1736234237-152029 X-HE-Meta: U2FsdGVkX18QVS43OFd3y2n+yRSg6r1539+fQe1d1U1twQwSJ50zwKYmF2v633/IMGnG/84si882KFJyP/RnrGncMbbTuvV4t2/83b3j+nFmyas7U6xDgqRCPFHSaui8zGAdpXUBjUIBMI1kGPgmZzCC+jZcPZm1scwl+UXafhr3LIQeRR0blvAM4py1LWAOjV3uAdOGI6cOcWTxrrTny9XeygGaerBv8GuicWbYGPurjA69PWVxUOpcfMc0Qv1Q41xPP0F/J08p/GzlgdaSUkL+pErftu4bn2CHEnxZsLmJTO73eIgldakjf+JOG8RV/RoP0UTZEOFXjC8SDEg39qz8dCPrUM/Bz5dp+jpZN2OGhgUhUgrmTMF8TNlEnBCDBwdwXfADUpuGapaOX2MfnHIHq7v1hwJCGcIqk/K/4TEqppphUBs/PHwafjuFg0S3Wa5FXgximQ7C34KHkdxaisgARJi2fDu9UvYYj6BPEB0/a/W1GHksZZLK2+PvubvqnLjXrbSe13GbJHqbJH8w8N50q52ks4BZfz77sYwlzEkaIYlkYoiPYWnxLPvgYXPUQcVHCmWRZKbnlXwDEqjHShUHRBaaSNhZGjwLbCy9ZHlnCVuDzNQHMe2+Zo3Ib365M60WdLOIL5bWDTGfNDXMWf+DBKJAL6ajfyh+L2PM0kOgnC4lriDjbuTab5WM8rwcLu4nMzvVBzIw8XnzL9idSkYJ56O4u0z/ZyGvxB0nYVVRo/Nskou70UD1PJtrHh2oGONN7RCsOAC0kqPSG5xXUL5BA+l24wuoZAMy92cPXYfHjT3BLYaoKBlq4kDwCFidojSvHuUjXylZuUHsfXF35LPEw7X+pP+ZHQ/dV4IyiWQAvTKDWUCF9+MiEXilspSgU2t312HJbJhtZL4r87B2qlU/c+bdWq9oI+SLOz9Puj0Grn9tuPuEaLZa1crMNwkHjOHUGaTEsvrP4aR/3sN foOUNiOd 6puxgDEP84PQBWysvDKoCyySal/b8fAqwcBc3i+KxZbqX2GY20kknTCYWtPlRlgP11N8kmhWjUbaAI8xDwazT9C9vW92BnmFVss1t1VIr7V0Pg8EoQSIcCh3OI4tZCVS0Rhtacm2IGQLB1D3x+wDZ9IaZ4LR+AXiwxPHAOi/pGLHsNt4Df7kPvlgf81Zflq+eeMFdisHMqbRpZIwBAjRM+priHjNB5BmC5PTGNLCrnLBw5VPUikw4FdnXBLRlKPA9wQKdv4NVngXlltbrKK6koGlSEw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 06/01/25 3:34 pm, Usama Arif wrote: > > On 16/12/2024 16:51, Dev Jain wrote: >> Scan the PTEs order-wise, using the mask of suitable orders for this VMA >> derived in conjunction with sysfs THP settings. Scale down the tunables; in >> case of collapse failure, we drop down to the next order. Otherwise, we try to >> jump to the highest possible order and then start a fresh scan. Note that >> madvise(MADV_COLLAPSE) has not been generalized. >> >> Signed-off-by: Dev Jain >> --- >> mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++--------- >> 1 file changed, 69 insertions(+), 15 deletions(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 886c76816963..078794aa3335 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, >> } >> >> static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> - int referenced, int unmapped, >> + int referenced, int unmapped, int order, >> struct collapse_control *cc) >> { >> LIST_HEAD(compound_pagelist); >> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> unsigned long address, bool *mmap_locked, >> struct collapse_control *cc) >> { >> - pmd_t *pmd; >> - pte_t *pte, *_pte; >> - int result = SCAN_FAIL, referenced = 0; >> - int none_or_zero = 0, shared = 0; >> - struct page *page = NULL; >> + unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap; >> + int referenced, shared, none_or_zero, unmapped; >> + unsigned long _address, org_address = address; >> struct folio *folio = NULL; >> - unsigned long _address; >> - spinlock_t *ptl; >> - int node = NUMA_NO_NODE, unmapped = 0; >> + struct page *page = NULL; >> + int node = NUMA_NO_NODE; >> + int result = SCAN_FAIL; >> bool writable = false; >> + unsigned long orders; >> + pte_t *pte, *_pte; >> + spinlock_t *ptl; >> + pmd_t *pmd; >> + int order; >> >> VM_BUG_ON(address & ~HPAGE_PMD_MASK); >> >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1); >> + orders = thp_vma_suitable_orders(vma, address, orders); >> + order = highest_order(orders); >> + >> + /* MADV_COLLAPSE needs to work irrespective of sysfs setting */ >> + if (!cc->is_khugepaged) >> + order = HPAGE_PMD_ORDER; >> + >> +scan_pte_range: >> + >> + max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order); >> + max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); >> + max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order); >> + referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0; >> + > Hi Dev, > > Thanks for the patches. > > Looking at the above code, I imagine you are planning to use the max_ptes_none, max_ptes_shared and > max_ptes_swap that is used for PMD THPs for all mTHP sizes? > > I think this can be a bit confusing for users who aren't familiar with kernel code, as the default > values are for PMD THPs, for e.g. max_ptes_none is 511, and the user might not know that it is going > to be scaled down for lower order THPs. You make sense. > > Another thing is, what if these parameters have different optimal values then the scaled down versions > of mTHP? By optimal, here we mean, how much the sysadmin wants khugepaged to succeed. If I want its success so bad that I am ready to collapse for a single filled entry, then this correspondence holds true for the scaled down version. There may be off-by-one errors but, well, they are off-by-one errors :) > > The other option is to introduce these parameters as new sysfs entries per mTHP size. These parameters > can be very difficult to tune (and are usually left at their default values), so I don't think its a > good idea to introduce new sysfs parameters, but just something to think about. Nonetheless you have a valid question, and I am not really sure how to go about this. If we are against new sysfs entries, then the only derivation from that is to scale down, and the only way the user will know that this is happening is kernel documentation. > > Regards, > Usama > >> + /* Check pmd after taking mmap lock */ >> result = find_pmd_or_thp_or_none(mm, address, &pmd); >> if (result != SCAN_SUCCEED) >> goto out; >> >> memset(cc->node_load, 0, sizeof(cc->node_load)); >> nodes_clear(cc->alloc_nmask); >> + >> pte = pte_offset_map_lock(mm, pmd, address, &ptl); >> if (!pte) { >> result = SCAN_PMD_NULL; >> goto out; >> } >> >> - for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; >> + for (_address = address, _pte = pte; _pte < pte + (1UL << order); >> _pte++, _address += PAGE_SIZE) { >> pte_t pteval = ptep_get(_pte); >> if (is_swap_pte(pteval)) { >> ++unmapped; >> if (!cc->is_khugepaged || >> - unmapped <= khugepaged_max_ptes_swap) { >> + unmapped <= max_ptes_swap) { >> /* >> * Always be strict with uffd-wp >> * enabled swap entries. Please see >> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> ++none_or_zero; >> if (!userfaultfd_armed(vma) && >> (!cc->is_khugepaged || >> - none_or_zero <= khugepaged_max_ptes_none)) { >> + none_or_zero <= max_ptes_none)) { >> continue; >> } else { >> result = SCAN_EXCEED_NONE_PTE; >> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> if (folio_likely_mapped_shared(folio)) { >> ++shared; >> if (cc->is_khugepaged && >> - shared > khugepaged_max_ptes_shared) { >> + shared > max_ptes_shared) { >> result = SCAN_EXCEED_SHARED_PTE; >> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); >> goto out_unmap; >> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> result = SCAN_PAGE_RO; >> } else if (cc->is_khugepaged && >> (!referenced || >> - (unmapped && referenced < HPAGE_PMD_NR / 2))) { >> + (unmapped && referenced < (1UL << order) / 2))) { >> result = SCAN_LACK_REFERENCED_PAGE; >> } else { >> result = SCAN_SUCCEED; >> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> pte_unmap_unlock(pte, ptl); >> if (result == SCAN_SUCCEED) { >> result = collapse_huge_page(mm, address, referenced, >> - unmapped, cc); >> + unmapped, order, cc); >> /* collapse_huge_page will return with the mmap_lock released */ >> *mmap_locked = false; >> + >> + /* Immediately exit on exhaustion of range */ >> + if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER)) >> + goto out; >> + } >> + if (result != SCAN_SUCCEED) { >> + >> + /* Go to the next order. */ >> + order = next_order(&orders, order); >> + if (order < 2) >> + goto out; >> + goto maybe_mmap_lock; >> + } else { >> + address = _address; >> + pte = _pte; >> + >> + >> + /* Get highest order possible starting from address */ >> + order = count_trailing_zeros(address >> PAGE_SHIFT); >> + >> + /* This needs to be present in the mask too */ >> + if (!(orders & (1UL << order))) >> + order = next_order(&orders, order); >> + if (order < 2) >> + goto out; >> + >> +maybe_mmap_lock: >> + if (!(*mmap_locked)) { >> + mmap_read_lock(mm); >> + *mmap_locked = true; >> + } >> + goto scan_pte_range; >> } >> out: >> trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,