From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57C04E77187 for ; Wed, 18 Dec 2024 09:24:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DA8A16B0082; Wed, 18 Dec 2024 04:24:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D57866B0083; Wed, 18 Dec 2024 04:24:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1F486B0085; Wed, 18 Dec 2024 04:24:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A439E6B0082 for ; Wed, 18 Dec 2024 04:24:31 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 4C19444698 for ; Wed, 18 Dec 2024 09:24:31 +0000 (UTC) X-FDA: 82907543586.15.29220B4 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf06.hostedemail.com (Postfix) with ESMTP id 36AD6180009 for ; Wed, 18 Dec 2024 09:24:06 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf06.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734513855; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f6eF8yrhXOyCXq3KPtr5dA0sg33yWkdBcH1WThcD7ZU=; b=j19YI68DkA2BRQl5fkGWh7vfRpZhPBWGvTPvC8mgneGv7dQxrHNUwCV+OP90T0TxML4/Ix QFiJ0iWtSEhQa8f/ctGlVK5FYzOpNneLqy8xkwOpVrj3SJZoIotYJ2XCspy9wlGuqSC9aY +zJl0W1NDcIysTkVxHwl/1pPjhrJK5U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734513855; a=rsa-sha256; cv=none; b=tAN+KjoNqfj7NoLgCs9eYEav/tynJb7kBoh3yaCOSFLJGErpxgDbaTDabcYQ0mQ++K7uXu rxlzNQkeap9dXJ7+u4AoOl+foQydTWWaipWvmlvKWZWEKYGIFWljobuC5zO5gO/Ka01rs4 +qLoxd1EL3U0EFbI0GMVt27zVsuNgF8= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf06.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 84E64FEC; Wed, 18 Dec 2024 01:24:56 -0800 (PST) Received: from [10.162.42.42] (K4MQJ0H1H2.blr.arm.com [10.162.42.42]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 792B73F7B4; Wed, 18 Dec 2024 01:24:18 -0800 (PST) Message-ID: Date: Wed, 18 Dec 2024 14:54:15 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 07/12] khugepaged: Scan PTEs order-wise To: Ryan Roberts , akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20241216165105.56185-1-dev.jain@arm.com> <20241216165105.56185-8-dev.jain@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: u16w7j1ph6b1ngu38hk3t8yjfy9qfqt6 X-Rspamd-Queue-Id: 36AD6180009 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1734513846-260812 X-HE-Meta: U2FsdGVkX1/6EkPhR1N2LO3F5r2KefA/YwGjfV8FZbo9ITworcU7Ju9zxxSXvEzbeW/GP8Wz7GU3eqy31tdcef4LnuGksjvarNZ+NjwmfqEYYIj4l+aHUEVH5VwCQOYQgla9vWdS318XYPH5uuAOyz8+dIkRIdUREgKK1UTvNwj+Jh2P+wBvCAW8RtplAgbr6Qn099ope+GTW9rwKN+5TA2Th0AmXVkmzwzGHx03zUMjipI9z5pBmoQNwRfOppnwI5WuQzHJ6otqBReOmHfFFKWlGDKaLvpiF/66QewFARn6IJDXaEX44Jws3LK7BGBVpI7yd+4Ig2NaBXyT9b7RTPA8mojPmGzZ+MLtQS7fZf5ylHrttU/IMKYF8T8JeV5TRXXegZrlF7KOI2nJlANmYNdf2iJelw9eQ36N/nDuOvuRbEK+0V14w77Pj9LPpPL10p+LXdaNwyCwAYbKLRKsd103eUKQ3TPBCAjTW8hug0q/Joi5WQVP8p8PQA1t97GaoWTmH+DxA2Ebpn0psbpgnbvRHp/0qA65ZNuxwgTRplWVoNeLE9uYEiR6K4M36FS6LSDsQ5Wkp0dE9PPqNvk87OYRifMEUMQWH6mWoD5Z8vBpCMwGrD342LK1LaSrXXlQ+mrx6Z5kMkgtqszvhYydDt0dRTTTDS4tI4jN5sHd8LgicCD8tYMwRfC6wDPdhFTkZIA5z1rwCBX9ks8PeoW8eh0gIRjHtJE7BCB4zR+MyAK1CGKr6keAwo/xjw32Oyo0pwalwjB3BaRSLO6cYimXKtQDRM85i69f7JuawM22c0mjLVC/gySbH18wuCZCF+empGVCJpPAqtxUZD4j9AM6LBp3wESP4ugLxzDLE81P3EaWBNgNCbTcosxO1wBo8tyOy3V0DWoTU0adkDQFb33mk1nr9URWyTCJ2P2A5sVQmPV4/H8spRBQkutWPwcwZlAC9q+noaTnEvN3UhIseJz IkKNtONd JuHRLddXJg1nUb7jyFbOQjx7eQnNrf33J2muKa7D2iRYcnUFadnJOzf0OF0rBYyAFk9byMH9zALZu9I/X/5yPcYJ/hd7RJtRwc3PEFdS5LY5NDD0bHMJ3lLkrus6xlB8Nk1Z6FV+axGW4bVglTLZvnQKMyBZQCzetMCyN6xqjgoTJl220P/bFLOleeVj6WqqtrUXiXXYwfS8hXg0oJqRWCFVl1NwEBnxlVA1olkPM1bIXhc2ubDRVQ2D5s1OlHYhfYLvHEXRP0afRyHWHZZ3IS5n61VMLpvwM23eewd2gDxVD7sKjKf7yq7F0kP1NNWbpF7sUgiTZEwMWoxtHJC0TCwvHTy/9qyC8eYGqBwZfnPdXYFM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 17/12/24 11:45 pm, Ryan Roberts wrote: > On 16/12/2024 16:51, Dev Jain wrote: >> Scan the PTEs order-wise, using the mask of suitable orders for this VMA >> derived in conjunction with sysfs THP settings. Scale down the tunables; in >> case of collapse failure, we drop down to the next order. Otherwise, we try to >> jump to the highest possible order and then start a fresh scan. Note that >> madvise(MADV_COLLAPSE) has not been generalized. > Is there are reason you are not modifying MADV_COLLAPSE? It's really just a > synchonous way to do what khugepaged does asynchonously (isn't it?), so it would > behave the same way in an ideal world. Correct, but I started running into return value problems for madvise(). For example, the return value of hpage_collapse_scan_ptes() will be the return value of the last mTHP scan. In this case, what do we want madvise() to return? If I collapse the range to multiple 64K mTHPs, then I should still return failure, because otherwise the caller would logically assume that MADV_COLLAPSE succeeded so will assume a PMD-hugepage mapped there. But then the caller ended up collapsing memory...if you return success, then the khugepaged selftest starts failing. Basically, this will be (kind of?) an ABI change and I really didn't want to sway the discussion away from khugepaged, so I just kept it simple :) > >> Signed-off-by: Dev Jain >> --- >> mm/khugepaged.c | 84 ++++++++++++++++++++++++++++++++++++++++--------- >> 1 file changed, 69 insertions(+), 15 deletions(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 886c76816963..078794aa3335 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -20,6 +20,7 @@ >> #include >> #include >> #include >> +#include >> >> #include >> #include >> @@ -1111,7 +1112,7 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, >> } >> >> static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> - int referenced, int unmapped, >> + int referenced, int unmapped, int order, >> struct collapse_control *cc) >> { >> LIST_HEAD(compound_pagelist); >> @@ -1278,38 +1279,59 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> unsigned long address, bool *mmap_locked, >> struct collapse_control *cc) >> { >> - pmd_t *pmd; >> - pte_t *pte, *_pte; >> - int result = SCAN_FAIL, referenced = 0; >> - int none_or_zero = 0, shared = 0; >> - struct page *page = NULL; >> + unsigned int max_ptes_shared, max_ptes_none, max_ptes_swap; >> + int referenced, shared, none_or_zero, unmapped; >> + unsigned long _address, org_address = address; > nit: Perhaps it's clearer to keep the original address in address and use a > variable, start, for the starting point of each scan? Probably...will keep it in mind. > >> struct folio *folio = NULL; >> - unsigned long _address; >> - spinlock_t *ptl; >> - int node = NUMA_NO_NODE, unmapped = 0; >> + struct page *page = NULL; >> + int node = NUMA_NO_NODE; >> + int result = SCAN_FAIL; >> bool writable = false; >> + unsigned long orders; >> + pte_t *pte, *_pte; >> + spinlock_t *ptl; >> + pmd_t *pmd; >> + int order; >> >> VM_BUG_ON(address & ~HPAGE_PMD_MASK); >> >> + orders = thp_vma_allowable_orders(vma, vma->vm_flags, >> + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER + 1) - 1); > Perhaps THP_ORDERS_ALL instead of "BIT(PMD_ORDER + 1) - 1"? Ah yes, THP_ORDERS_ALL_ANON. > >> + orders = thp_vma_suitable_orders(vma, address, orders); >> + order = highest_order(orders); >> + >> + /* MADV_COLLAPSE needs to work irrespective of sysfs setting */ >> + if (!cc->is_khugepaged) >> + order = HPAGE_PMD_ORDER; >> + >> +scan_pte_range: >> + >> + max_ptes_shared = khugepaged_max_ptes_shared >> (HPAGE_PMD_ORDER - order); >> + max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); >> + max_ptes_swap = khugepaged_max_ptes_swap >> (HPAGE_PMD_ORDER - order); >> + referenced = 0, shared = 0, none_or_zero = 0, unmapped = 0; >> + >> + /* Check pmd after taking mmap lock */ >> result = find_pmd_or_thp_or_none(mm, address, &pmd); >> if (result != SCAN_SUCCEED) >> goto out; >> >> memset(cc->node_load, 0, sizeof(cc->node_load)); >> nodes_clear(cc->alloc_nmask); >> + >> pte = pte_offset_map_lock(mm, pmd, address, &ptl); >> if (!pte) { >> result = SCAN_PMD_NULL; >> goto out; >> } >> >> - for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; >> + for (_address = address, _pte = pte; _pte < pte + (1UL << order); >> _pte++, _address += PAGE_SIZE) { >> pte_t pteval = ptep_get(_pte); >> if (is_swap_pte(pteval)) { >> ++unmapped; >> if (!cc->is_khugepaged || >> - unmapped <= khugepaged_max_ptes_swap) { >> + unmapped <= max_ptes_swap) { >> /* >> * Always be strict with uffd-wp >> * enabled swap entries. Please see >> @@ -1330,7 +1352,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> ++none_or_zero; >> if (!userfaultfd_armed(vma) && >> (!cc->is_khugepaged || >> - none_or_zero <= khugepaged_max_ptes_none)) { >> + none_or_zero <= max_ptes_none)) { >> continue; >> } else { >> result = SCAN_EXCEED_NONE_PTE; >> @@ -1375,7 +1397,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> if (folio_likely_mapped_shared(folio)) { >> ++shared; >> if (cc->is_khugepaged && >> - shared > khugepaged_max_ptes_shared) { >> + shared > max_ptes_shared) { >> result = SCAN_EXCEED_SHARED_PTE; >> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); >> goto out_unmap; >> @@ -1432,7 +1454,7 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> result = SCAN_PAGE_RO; >> } else if (cc->is_khugepaged && >> (!referenced || >> - (unmapped && referenced < HPAGE_PMD_NR / 2))) { >> + (unmapped && referenced < (1UL << order) / 2))) { >> result = SCAN_LACK_REFERENCED_PAGE; >> } else { >> result = SCAN_SUCCEED; >> @@ -1441,9 +1463,41 @@ static int hpage_collapse_scan_ptes(struct mm_struct *mm, >> pte_unmap_unlock(pte, ptl); >> if (result == SCAN_SUCCEED) { >> result = collapse_huge_page(mm, address, referenced, >> - unmapped, cc); >> + unmapped, order, cc); >> /* collapse_huge_page will return with the mmap_lock released */ >> *mmap_locked = false; >> + >> + /* Immediately exit on exhaustion of range */ >> + if (_address == org_address + (PAGE_SIZE << HPAGE_PMD_ORDER)) >> + goto out; > Looks like this assumes this function is always asked to scan a full PTE table? > Does that mean that you can't handle collapse for VMAs that don't span a whole > PMD entry? I think we will want to support that. Correct. Yes, we should support that, otherwise khugepaged will scan only large VMAs. Will have to make a change in khugepaged_scan_mm_slot() for anon case. > >> + } >> + if (result != SCAN_SUCCEED) { >> + >> + /* Go to the next order. */ >> + order = next_order(&orders, order); >> + if (order < 2) > This should be: > if (!orders) > > I think the return order is undefined when order is the last order in orders. The return order is -1, from what I could gather from reading the code. > >> + goto out; >> + goto maybe_mmap_lock; >> + } else { >> + address = _address; >> + pte = _pte; >> + >> + >> + /* Get highest order possible starting from address */ >> + order = count_trailing_zeros(address >> PAGE_SHIFT); >> + >> + /* This needs to be present in the mask too */ >> + if (!(orders & (1UL << order))) >> + order = next_order(&orders, order); > Not quite; if the exact order isn't in the bitmap, this will pick out the > highest order in the bitmap, which may be higher than count_trailing_zeros() > returned. Oh okay, nice catch. > You could do: > > order = count_trailing_zeros(address >> PAGE_SHIFT); > orders &= (1UL << order + 1) - 1; > order = next_order(&orders, order); > if (!orders) > goto out; > > That will mask out any orders that are bigger than the one returned by > count_trailing_zeros() then next_order() will return the highest order in the > remaining set. > > But even that doesn't quite work because next_order() is destructive. Once you > arrive on a higher order address boundary, you want to be able to select a > higher order from the original orders bitmap. But you have lost them on a > previous go around the loop. > > Perhaps stash orig_orders at the top of the function when you first calculate > it. Then I think this works (totally untested): > > order = count_trailing_zeros(address >> PAGE_SHIFT); > orders = orig_orders & (1UL << order + 1) - 1; > order = next_order(&orders, order); > if (!orders) > goto out; > > You might want to do something like this for the first go around the loop, but I > think address is currently always at the start of the PMD on entry, so not > needed until that restriction is removed. Will take a look, we just need order <= order derived from trailing zeroes, and then we need the first enabled order below this in the bitmask, shouldn't be too complicated. > >> + if (order < 2) >> + goto out; >> + >> +maybe_mmap_lock: >> + if (!(*mmap_locked)) { >> + mmap_read_lock(mm); > Given the lock was already held in read mode on entering this function, then > released by collapse_huge_page(), is it definitely safe to retake this lock and > rerun this function? Is it possible that state that was checked before entering > this function has changed since the lock was released that would now need > re-checking? Thanks, what I am missing is a hugepage_vma_revalidate(). >> + *mmap_locked = true; >> + } >> + goto scan_pte_range; >> } >> out: >> trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,