From: "Yin, Fengwei" <fengwei.yin@intel.com>
To: Ryan Roberts <ryan.roberts@arm.com>,
Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
David Hildenbrand <david@redhat.com>, Yu Zhao <yuzhao@google.com>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Yang Shi <shy828301@gmail.com>
Cc: <linux-arm-kernel@lists.infradead.org>,
<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance
Date: Tue, 4 Jul 2023 11:45:14 +0800 [thread overview]
Message-ID: <6865a59e-9e40-282d-c434-b7c757388b65@intel.com> (raw)
In-Reply-To: <20230703135330.1865927-5-ryan.roberts@arm.com>
On 7/3/2023 9:53 PM, Ryan Roberts wrote:
> Introduce FLEXIBLE_THP feature, which allows anonymous memory to be
THP is for huge page which is 2M size. We are not huge page here. But
I don't have good name either.
> allocated in large folios of a specified order. All pages of the large
> folio are pte-mapped during the same page fault, significantly reducing
> the number of page faults. The number of per-page operations (e.g. ref
> counting, rmap management lru list management) are also significantly
> reduced since those ops now become per-folio.
>
> The new behaviour is hidden behind the new FLEXIBLE_THP Kconfig, which
> defaults to disabled for now; there is a long list of todos to make
> FLEXIBLE_THP robust with existing features (e.g. compaction, mlock, some
> madvise ops, etc). These items will be tackled in subsequent patches.
>
> When enabled, the preferred folio order is as returned by
> arch_wants_pte_order(), which may be overridden by the arch as it sees
> fit. Some architectures (e.g. arm64) can coalsece TLB entries if a
> contiguous set of ptes map physically contigious, naturally aligned
> memory, so this mechanism allows the architecture to optimize as
> required.
>
> If the preferred order can't be used (e.g. because the folio would
> breach the bounds of the vma, or because ptes in the region are already
> mapped) then we fall back to a suitable lower order.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
> mm/Kconfig | 10 ++++
> mm/memory.c | 168 ++++++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 165 insertions(+), 13 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..1c06b2c0a24e 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -822,6 +822,16 @@ config READ_ONLY_THP_FOR_FS
> support of file THPs will be developed in the next few release
> cycles.
>
> +config FLEXIBLE_THP
> + bool "Flexible order THP"
> + depends on TRANSPARENT_HUGEPAGE
> + default n
> + help
> + Use large (bigger than order-0) folios to back anonymous memory where
> + possible, even if the order of the folio is smaller than the PMD
> + order. This reduces the number of page faults, as well as other
> + per-page overheads to improve performance for many workloads.
> +
> endif # TRANSPARENT_HUGEPAGE
>
> #
> diff --git a/mm/memory.c b/mm/memory.c
> index fb30f7523550..abe2ea94f3f5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3001,6 +3001,116 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
> return 0;
> }
>
> +#ifdef CONFIG_FLEXIBLE_THP
> +/*
> + * Allocates, zeros and returns a folio of the requested order for use as
> + * anonymous memory.
> + */
> +static struct folio *alloc_anon_folio(struct vm_area_struct *vma,
> + unsigned long addr, int order)
> +{
> + gfp_t gfp;
> + struct folio *folio;
> +
> + if (order == 0)
> + return vma_alloc_zeroed_movable_folio(vma, addr);
> +
> + gfp = vma_thp_gfp_mask(vma);
> + folio = vma_alloc_folio(gfp, order, vma, addr, true);
> + if (folio)
> + clear_huge_page(&folio->page, addr, folio_nr_pages(folio));
> +
> + return folio;
> +}
> +
> +/*
> + * Preferred folio order to allocate for anonymous memory.
> + */
> +#define max_anon_folio_order(vma) arch_wants_pte_order(vma)
> +#else
> +#define alloc_anon_folio(vma, addr, order) \
> + vma_alloc_zeroed_movable_folio(vma, addr)
> +#define max_anon_folio_order(vma) 0
> +#endif
> +
> +/*
> + * Returns index of first pte that is not none, or nr if all are none.
> + */
> +static inline int check_ptes_none(pte_t *pte, int nr)
> +{
> + int i;
> +
> + for (i = 0; i < nr; i++) {
> + if (!pte_none(ptep_get(pte++)))
> + return i;
> + }
> +
> + return nr;
> +}
> +
> +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order)
> +{
> + /*
> + * The aim here is to determine what size of folio we should allocate
> + * for this fault. Factors include:
> + * - Order must not be higher than `order` upon entry
> + * - Folio must be naturally aligned within VA space
> + * - Folio must be fully contained inside one pmd entry
> + * - Folio must not breach boundaries of vma
> + * - Folio must not overlap any non-none ptes
> + *
> + * Additionally, we do not allow order-1 since this breaks assumptions
> + * elsewhere in the mm; THP pages must be at least order-2 (since they
> + * store state up to the 3rd struct page subpage), and these pages must
> + * be THP in order to correctly use pre-existing THP infrastructure such
> + * as folio_split().
> + *
> + * Note that the caller may or may not choose to lock the pte. If
> + * unlocked, the result is racy and the user must re-check any overlap
> + * with non-none ptes under the lock.
> + */
> +
> + struct vm_area_struct *vma = vmf->vma;
> + int nr;
> + unsigned long addr;
> + pte_t *pte;
> + pte_t *first_set = NULL;
> + int ret;
> +
> + order = min(order, PMD_SHIFT - PAGE_SHIFT);
> +
> + for (; order > 1; order--) {
> + nr = 1 << order;
> + addr = ALIGN_DOWN(vmf->address, nr << PAGE_SHIFT);
> + pte = vmf->pte - ((vmf->address - addr) >> PAGE_SHIFT);
> +
> + /* Check vma bounds. */
> + if (addr < vma->vm_start ||
> + addr + (nr << PAGE_SHIFT) > vma->vm_end)
> + continue;
> +
> + /* Ptes covered by order already known to be none. */
> + if (pte + nr <= first_set)
> + break;
> +
> + /* Already found set pte in range covered by order. */
> + if (pte <= first_set)
> + continue;
> +
> + /* Need to check if all the ptes are none. */
> + ret = check_ptes_none(pte, nr);
> + if (ret == nr)
> + break;
> +
> + first_set = pte + ret;
> + }
> +
> + if (order == 1)
> + order = 0;
> +
> + return order;
> +}
The logic in above function should be kept is whether the order fit in vma range.
check_ptes_none() is not accurate here because no page table lock hold and concurrent
fault could happen. So may just drop the check here? Check_ptes_none() is done after
take the page table lock.
We pick the arch prefered order or order 0 now.
> +
> /*
> * Handle write page faults for pages that can be reused in the current vma
> *
> @@ -3073,7 +3183,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> goto oom;
>
> if (is_zero_pfn(pte_pfn(vmf->orig_pte))) {
> - new_folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> + new_folio = alloc_anon_folio(vma, vmf->address, 0);
> if (!new_folio)
> goto oom;
> } else {
> @@ -4040,6 +4150,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> struct folio *folio;
> vm_fault_t ret = 0;
> pte_t entry;
> + int order;
> + int pgcount;
> + unsigned long addr;
>
> /* File mapping without ->vm_ops ? */
> if (vma->vm_flags & VM_SHARED)
> @@ -4081,24 +4194,51 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return handle_userfault(vmf, VM_UFFD_MISSING);
> }
> - goto setpte;
> + if (uffd_wp)
> + entry = pte_mkuffd_wp(entry);
> + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> +
> + /* No need to invalidate - it was non-present before */
> + update_mmu_cache(vma, vmf->address, vmf->pte);
> + goto unlock;
> + }
> +
> + /*
> + * If allocating a large folio, determine the biggest suitable order for
> + * the VMA (e.g. it must not exceed the VMA's bounds, it must not
> + * overlap with any populated PTEs, etc). We are not under the ptl here
> + * so we will need to re-check that we are not overlapping any populated
> + * PTEs once we have the lock.
> + */
> + order = uffd_wp ? 0 : max_anon_folio_order(vma);
> + if (order > 0) {
> + vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> + order = calc_anon_folio_order_alloc(vmf, order);
> + pte_unmap(vmf->pte);
> }
>
> - /* Allocate our own private page. */
> + /* Allocate our own private folio. */
> if (unlikely(anon_vma_prepare(vma)))
> goto oom;
> - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address);
> + folio = alloc_anon_folio(vma, vmf->address, order);
> + if (!folio && order > 0) {
> + order = 0;
> + folio = alloc_anon_folio(vma, vmf->address, order);
> + }
> if (!folio)
> goto oom;
>
> + pgcount = 1 << order;
> + addr = ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT);
> +
> if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> goto oom_free_page;
> folio_throttle_swaprate(folio, GFP_KERNEL);
>
> /*
> * The memory barrier inside __folio_mark_uptodate makes sure that
> - * preceding stores to the page contents become visible before
> - * the set_pte_at() write.
> + * preceding stores to the folio contents become visible before
> + * the set_ptes() write.
> */
> __folio_mark_uptodate(folio);
>
> @@ -4107,11 +4247,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> if (vma->vm_flags & VM_WRITE)
> entry = pte_mkwrite(pte_mkdirty(entry));
>
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
> - &vmf->ptl);
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> if (vmf_pte_changed(vmf)) {
> update_mmu_tlb(vma, vmf->address, vmf->pte);
> goto release;
> + } else if (order > 0 && check_ptes_none(vmf->pte, pgcount) != pgcount) {
This could be the case that we allocated order 4 page and find a neighbor PTE is
filled by concurrent fault. Should we put current folio and fallback to order 0
and try again immedately (goto order 0 allocation instead of return from this
function which will go through some page fault path again)?
Regards
Yin, Fengwei
> + goto release;
> }
>
> ret = check_stable_address_space(vma->vm_mm);
> @@ -4125,16 +4266,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> return handle_userfault(vmf, VM_UFFD_MISSING);
> }
>
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - folio_add_new_anon_rmap(folio, vma, vmf->address);
> + folio_ref_add(folio, pgcount - 1);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount);
> + folio_add_new_anon_rmap(folio, vma, addr);
> folio_add_lru_vma(folio, vma);
> -setpte:
> +
> if (uffd_wp)
> entry = pte_mkuffd_wp(entry);
> - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount);
>
> /* No need to invalidate - it was non-present before */
> - update_mmu_cache(vma, vmf->address, vmf->pte);
> + update_mmu_cache_range(vma, addr, vmf->pte, pgcount);
> unlock:
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return ret;
next prev parent reply other threads:[~2023-07-04 3:45 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-03 13:53 [PATCH v2 0/5] variable-order, large folios for anonymous memory Ryan Roberts
2023-07-03 13:53 ` [PATCH v2 1/5] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Ryan Roberts
2023-07-03 19:05 ` Yu Zhao
2023-07-04 2:13 ` Yin, Fengwei
2023-07-04 11:19 ` Ryan Roberts
2023-07-04 2:14 ` Yin, Fengwei
2023-07-03 13:53 ` [PATCH v2 2/5] mm: Allow deferred splitting of arbitrary large anon folios Ryan Roberts
2023-07-07 8:21 ` Huang, Ying
2023-07-07 9:42 ` Ryan Roberts
2023-07-10 5:37 ` Huang, Ying
2023-07-10 8:29 ` Ryan Roberts
2023-07-10 9:01 ` Huang, Ying
2023-07-10 9:39 ` Ryan Roberts
2023-07-11 1:56 ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 3/5] mm: Default implementation of arch_wants_pte_order() Ryan Roberts
2023-07-03 19:50 ` Yu Zhao
2023-07-04 13:20 ` Ryan Roberts
2023-07-05 2:07 ` Yu Zhao
2023-07-05 9:11 ` Ryan Roberts
2023-07-05 17:24 ` Yu Zhao
2023-07-05 18:01 ` Ryan Roberts
2023-07-06 19:33 ` Matthew Wilcox
2023-07-07 10:00 ` Ryan Roberts
2023-07-04 2:22 ` Yin, Fengwei
2023-07-04 3:02 ` Yu Zhao
2023-07-04 3:59 ` Yu Zhao
2023-07-04 5:22 ` Yin, Fengwei
2023-07-04 5:42 ` Yu Zhao
2023-07-04 12:36 ` Ryan Roberts
2023-07-04 13:23 ` Ryan Roberts
2023-07-05 1:40 ` Yu Zhao
2023-07-05 1:23 ` Yu Zhao
2023-07-05 2:18 ` Yin Fengwei
2023-07-03 13:53 ` [PATCH v2 4/5] mm: FLEXIBLE_THP for improved performance Ryan Roberts
2023-07-03 15:51 ` kernel test robot
2023-07-03 16:01 ` kernel test robot
2023-07-04 1:35 ` Yu Zhao
2023-07-04 14:08 ` Ryan Roberts
2023-07-04 23:47 ` Yu Zhao
2023-07-04 3:45 ` Yin, Fengwei [this message]
2023-07-04 14:20 ` Ryan Roberts
2023-07-04 23:35 ` Yin Fengwei
2023-07-04 23:57 ` Matthew Wilcox
2023-07-05 9:54 ` Ryan Roberts
2023-07-05 12:08 ` Matthew Wilcox
2023-07-07 8:01 ` Huang, Ying
2023-07-07 9:52 ` Ryan Roberts
2023-07-07 11:29 ` David Hildenbrand
2023-07-07 13:57 ` Matthew Wilcox
2023-07-07 14:07 ` David Hildenbrand
2023-07-07 15:13 ` Ryan Roberts
2023-07-07 16:06 ` David Hildenbrand
2023-07-07 16:22 ` Ryan Roberts
2023-07-07 19:06 ` David Hildenbrand
2023-07-10 8:41 ` Ryan Roberts
2023-07-10 3:03 ` Huang, Ying
2023-07-10 8:55 ` Ryan Roberts
2023-07-10 9:18 ` Huang, Ying
2023-07-10 9:25 ` Ryan Roberts
2023-07-11 0:48 ` Huang, Ying
2023-07-10 2:49 ` Huang, Ying
2023-07-03 13:53 ` [PATCH v2 5/5] arm64: mm: Override arch_wants_pte_order() Ryan Roberts
2023-07-03 20:02 ` Yu Zhao
2023-07-04 2:18 ` [PATCH v2 0/5] variable-order, large folios for anonymous memory Yu Zhao
2023-07-04 6:22 ` Yin, Fengwei
2023-07-04 7:11 ` Yu Zhao
2023-07-04 15:36 ` Ryan Roberts
2023-07-04 23:52 ` Yin Fengwei
2023-07-05 0:21 ` Yu Zhao
2023-07-05 10:16 ` Ryan Roberts
2023-07-05 19:00 ` Yu Zhao
2023-07-05 19:38 ` David Hildenbrand
2023-07-06 8:02 ` Ryan Roberts
2023-07-07 11:40 ` David Hildenbrand
2023-07-07 13:12 ` Matthew Wilcox
2023-07-07 13:24 ` David Hildenbrand
2023-07-10 10:07 ` Ryan Roberts
2023-07-10 16:57 ` Matthew Wilcox
2023-07-10 16:53 ` Zi Yan
2023-07-19 15:49 ` Ryan Roberts
2023-07-19 16:05 ` Zi Yan
2023-07-19 18:37 ` Ryan Roberts
2023-07-11 21:11 ` Luis Chamberlain
2023-07-11 21:59 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6865a59e-9e40-282d-c434-b7c757388b65@intel.com \
--to=fengwei.yin@intel.com \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=catalin.marinas@arm.com \
--cc=david@redhat.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox