From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 414D5C433EF for ; Sat, 4 Jun 2022 00:40:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 431FD8D0002; Fri, 3 Jun 2022 20:40:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3BD688D0001; Fri, 3 Jun 2022 20:40:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 25BC88D0002; Fri, 3 Jun 2022 20:40:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 05DAA8D0001 for ; Fri, 3 Jun 2022 20:40:14 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id CD4D134F9F for ; Sat, 4 Jun 2022 00:40:13 +0000 (UTC) X-FDA: 79538696706.20.D66BC13 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf01.hostedemail.com (Postfix) with ESMTP id E39C440013 for ; Sat, 4 Jun 2022 00:40:04 +0000 (UTC) Received: by mail-pj1-f74.google.com with SMTP id c11-20020a17090a4d0b00b001e4e081d525so5542071pjg.7 for ; Fri, 03 Jun 2022 17:40:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=oWZTVM6MiUYcpOreLTuSzEV3TXQMu/oRf9Qclrl3yhs=; b=GRBd3IpJXvOzjheqLFbxa94rFYUp1eaPF3uvdxenOWy5QytQ895w2aKhrFWTwY2wCg Q8MSLR3yHbVXyrQ4er/KEVyNu71ONS8cqAEUKHNIbFO2neqwB9iIpSd8N0ZQ85faHnN9 ztejy8zgZ6rTqsFh/Bii0gbEj1dFhc2CrcEB6Pm/tFmMOp1z+JyBfvVsFiJxeMhpruEQ hufkEHc35Dp1sc6gBxSF8idsXeDwDl6cludYZfzpqLV7TOE7QywGsYUg3gvm91jSaOyl /EZZSs9Dd7Q1AEaYzAu2fjOWrjpzR2YTpC+dB4TD88csxCmU8ZFeZAQaU1vcSIvbK7qu M7BQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=oWZTVM6MiUYcpOreLTuSzEV3TXQMu/oRf9Qclrl3yhs=; b=epVT+RR5dDZd7zrtPJMsoKqSH4PBzpXxyNGN0jPH8sJm3/NQYexZtclr6jNaXl35Zm 44AabUIui8a3cr+V7I63ChIrpqMvxfj6HvG2pc742qL5YuNt8DFmsICoca0e3LwmYHVz Kpjb/9l03vLCjgPdRLzsD0HnrVvDV0MTozEPPmrEmTOUPPo9J8m7sGxA7HnjOpJOQrHg XnmqZ9xsp4jWEuUbN0yZk55rnyKbTfUCdhJi7Pj0EiGm0oOgTvRdLADzuHZoMCOQPCSM qdglHK9x0UVIUObjicPRPBKAv2IrFaVTaNPJEtPqk7JNKhaosBiyF//ikJ937AA+3xcf 2o7A== X-Gm-Message-State: AOAM530vS1MtkfI+5TBM+pS2yqCf56LL1PL//17aa8iDxDGAezVJG7tO PKuyTrWpoLtRXuaIpJXON4ZubmN9RWyJ X-Google-Smtp-Source: ABdhPJy+CVJF/WQSI22+c8uJHAEE8pUEfy9qBquZVcJTV6v6JYyPG2l1U07WPtc+xFByAPjOFmQXma40TsKo X-Received: from zokeefe3.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:1b6]) (user=zokeefe job=sendgmr) by 2002:a05:6a00:c94:b0:518:d3dc:be1f with SMTP id a20-20020a056a000c9400b00518d3dcbe1fmr12592960pfv.76.1654303212199; Fri, 03 Jun 2022 17:40:12 -0700 (PDT) Date: Fri, 3 Jun 2022 17:39:49 -0700 Message-Id: <20220604004004.954674-1-zokeefe@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.36.1.255.ge46751e96f-goog Subject: [PATCH v6 00/15] mm: userspace hugepage collapse From: "Zach O'Keefe" To: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , Rongwei Wang , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org Cc: Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer , "Zach O'Keefe" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: edhg7swc3aw3sd51iorduoezf3qs45zx X-Rspam-User: Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=GRBd3IpJ; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf01.hostedemail.com: domain of 37KmaYgcKCMUAzvppqprzzrwp.nzxwty58-xxv6lnv.z2r@flex--zokeefe.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=37KmaYgcKCMUAzvppqprzzrwp.nzxwty58-xxv6lnv.z2r@flex--zokeefe.bounces.google.com X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: E39C440013 X-HE-Tag: 1654303204-901470 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: v6 Forward -------------------------------- v6 improves on v5[1] in 3 major ways: 1. Changed MADV_COLLAPSE eligibility semantics. In v5, MADV_COLLAPSE ignored khugepaged max_ptes_* sysfs settings, as well as all sysfs defrag settings. v6 takes this further by also decoupling MADV_COLLAPSE from sysfs enabled setting. MADV_COLLAPSE can now initiate a collapse of memory into THPs in "madvise" and "never" mode, and doesn't ever require VM_HUGEPAGE. MADV_COLLAPSE retains it's adherence to not operating on VM_NOHUGEPAGE-marked VMAs. 2. Thanks to a patch by Yang Shi to remove UMA hugepage preallocation, hugepage allocation in khugepaged is independent of CONFIG_NUMA. This allows us to reuse all the allocation codepaths between collapse contexts, greatly simplifying struct collapse_control. Redundant khugepaged heuristic flags have also been merged into a new enforce_page_heuristics flag. 3. Using MADV_COLLAPSE's new eligibility semantics, the hacks in the selftests to disable khugepaged are no longer necessary, since we can test MADV_COLLAPSE in "never" THP mode to prevent khugepaged interaction. Introduction -------------------------------- This series provides a mechanism for userspace to induce a collapse of eligible ranges of memory into transparent hugepages in process context, thus permitting users to more tightly control their own hugepage utilization policy at their own expense. This idea was introduced by David Rientjes[2]. Interface -------------------------------- The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and leverages the new process_madvise(2) call. process_madvise(2) Performs a synchronous collapse of the native pages mapped by the list of iovecs into transparent hugepages. This operation is independent of the system THP sysfs settings, but attempts to collapse VMAs marked VM_NOHUGEPAGE will still fail. THP allocation may enter direct reclaim and/or compaction. When a range spans multiple VMAs, the semantics of the collapse over of each VMA is independent from the others. Caller must have CAP_SYS_ADMIN if not acting on self. Return value follows existing process_madvise(2) conventions. A =E2=80=9Csuccess=E2=80=9D indicates that all hugepage-sized/aligned region= s covered by the provided range were either successfully collapsed, or were already pmd-mapped THPs. madvise(2) Equivalent to process_madvise(2) on self, with 0 returned on =E2=80=9Csuccess=E2=80=9D. Current Use-Cases -------------------------------- (1) Immediately back executable text by THPs. Current support provided by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which might impair services from serving at their full rated load after (re)starting. Tricks like mremap(2)'ing text onto anonymous memory to immediately realize iTLB performance prevents page sharing and demand paging, both of which increase steady state memory footprint. With MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance and lower RAM footprints. Note that subsequent support for file-backed memory is required here. (2) malloc() implementations that manage memory in hugepage-sized chunks, but sometimes subrelease memory back to the system in native-sized chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage coverage and dTLB performance. TCMalloc is such an implementation that could benefit from this[3]. A prior study of Google internal workloads during evaluation of Temeraire, a hugepage-aware enhancement to TCMalloc, showed that nearly 20% of all cpu cycles were spent in dTLB stalls, and that increasing hugepage coverage by even small amount can help with that[4]. Future work -------------------------------- Only private anonymous memory is supported by this series. File and shmem memory support will be added later. One possible user of this functionality is a userspace agent that attempts to optimize THP utilization system-wide by allocating THPs based on, for example, task priority, task performance requirements, or heatmaps. For the latter, one idea that has already surfaced is using DAMON to identify hot regions, and driving THP collapse through a new DAMOS_COLLAPSE scheme[5]. Sequence of Patches -------------------------------- * Patch 1 (Yang Shi) removes UMA hugepage preallocation and makes khugepaged hugepage allocation independent of CONFIG_NUMA * Patches 2-8 perform refactoring of collapse logic within khugepaged.c and introduce the notion of a collapse context. * Patch 9 introduces MADV_COLLAPSE and is the main patch in this series. * Patch 10 is a tidy-up. * Patches 11 adds process_madvise(2) support. * Patches 12-14 add selftests. * Patch 15 adds support for user tools. Applies against next-20220603 Changelog -------------------------------- v5 -> v6: * Added 'mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA' (Yang Shi) * 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP' -> Add a pmd_bad() check for nonhuge pmds (Peter Xu) * 'mm/khugepaged: dedup and simplify hugepage alloc and charging' -> Remove dependency on 'mm/khugepaged: sched to numa node when collapse huge page' -> No more !NUMA casing * 'mm/khugepaged: make allocation semantics context-specific' -> Renamed from 'mm/khugepaged: make hugepage allocation context-specific' -> Removed function pointer hooks. (David Rientjes) -> Added gfp_t member to control allocation semantics. * 'mm/khugepaged: add flag to ignore khugepaged heuristics' -> Squashed from 'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*' and 'mm/khugepaged: add flag to ignore page young/referenced requirement'. (David Rientjes) * Added 'mm/khugepaged: add flag to ignore THP sysfs enabled' * 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse' -> Use hugepage_vma_check() instead of transparent_hugepage_active() to determine vma eligibility. -> Only retry collapse once per hugepage if pages aren't found on LRU -> Save last failed result for more accurate errno -> Refactored loop structure -> Renamed labels * 'selftests/vm: modularize collapse selftests' -> Refactored into straightline code and removed loop over contexts. * 'selftests/vm: add MADV_COLLAPSE collapse context to selftests; -> Removed ->init() and ->cleanup() hooks from struct collapse_context() (David Rientjes) -> MADV_COLLAPSE operates in "never" THP mode to prevent khugepaged interaction. Removed all the previous khugepaged hacks. * Added 'tools headers uapi: add MADV_COLLAPSE madvise mode to tools' * Rebased on next-20220603 v4 -> v5: * Fix kernel test robot errors * 'mm/khugepaged: make hugepage allocation context-specific' -> Fix khugepaged_alloc_page() UMA definition * 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse' -> Add "fallthrough" pseudo keyword to fix -Wimplicit-fallthrough v3 -> v4: * 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP' -> Dropped pmd_none() check from find_pmd_or_thp_or_none() -> Moved SCAN_PMD_MAPPED after SCAN_PMD_NULL -> Dropped from sign-offs * 'mm/khugepaged: add struct collapse_control' -> Updated commit description and some code comments -> Removed extra brackets added in khugepaged_find_target_node() * Added 'mm/khugepaged: dedup hugepage allocation and charging code' * 'mm/khugepaged: make hugepage allocation context-specific' -> Has been majorly reworked to replace ->gfp() and ->alloc_hpage() struct collapse_control hooks with a ->alloc_charge_hpage() hook which makes node-allocation, gfp flags, node scheduling, hpage allocation, and accounting/charging context-specific. -> Dropped from sign-offs * Added 'mm/khugepaged: pipe enum scan_result codes back to callers' -> Replaces 'mm/khugepaged: add struct collapse_result' * Dropped 'mm/khugepaged: add struct collapse_result' * 'mm/khugepaged: add flag to ignore khugepaged_max_ptes_*' -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse' * 'mm/khugepaged: add flag to ignore page young/referenced requirement' -> Moved before 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse' * 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse' -> Moved struct collapse_control* argument to end of alloc_hpage() -> Some refactoring to rebase on top changes to struct collapse_control hook changes and other previous commits. -> Reworded commit description -> Dropped from sign-offs * 'mm/khugepaged: rename prefix of shared collapse functions' -> Renamed from 'mm/khugepaged: remove khugepaged prefix from shared collapse functions' -> Instead of dropping "khugepaged_" prefix, replace with "hpage_collapse_" -> Dropped from sign-offs * Rebased onto next-20220502 v2 -> v3: * Collapse semantics have changed: the gfp flags used for hugepage allocation now are independent of khugepaged. * Cover-letter: add primary use-cases and update description of collapse semantics. * 'mm/khugepaged: make hugepage allocation context-specific' -> Added .gfp operation to struct collapse_control * 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse' -> Added madvise context .gfp implementation. -> Set scan_result appropriately on early exit due to mm exit or vma vma revalidation. -> Reword patch description * Rebased onto next-20220426 v1 -> v2: * Cover-letter clarification and added RFC -> v1 notes * Fixes issues reported by kernel test robot * 'mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP' -> Fixed mixed code/declarations * 'mm/khugepaged: make hugepage allocation context-specific' -> Fixed bad function signature in !NUMA && TRANSPARENT_HUGEPAGE configs -> Added doc comment to retract_page_tables() for "cc" * 'mm/khugepaged: add struct collapse_result' -> Added doc comment to retract_page_tables() for "cr" * 'mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse' -> Added MADV_COLLAPSE definitions for alpha, mips, parisc, xtensa -> Moved an "#ifdef NUMA" so that khugepaged_find_target_node() is defined in !NUMA && TRANSPARENT_HUGEPAGE configs. * 'mm/khugepaged: remove khugepaged prefix from shared collapse' functions -> Removed khugepaged prefix from khugepaged_find_target_node on L914 * Rebased onto next-20220414 RFC -> v1: * The series was significantly reworked from RFC and most patches are entirely new or reworked. * Collapse eligibility criteria has changed: MADV_COLLAPSE now respects VM_NOHUGEPAGE. * Collapse semantics have changed: the gfp flags used for hugepage allocation now match that of khugepaged for the same VMA, instead of the gfp flags used at-fault for calling process for the VMA. * Collapse semantics have changed: The collapse semantics for multiple VMAs spanning a single MADV_COLLAPSE call are now independent, whereas before the idea was to allow direct reclaim/compaction if any spanned VMA permitted so. * The process_madvise(2) flags, MADV_F_COLLAPSE_LIMITS and MADV_F_COLLAPSE_DEFRAG have been removed. * Implementation change: the RFC implemented collapse over a range of hugepages in a batched-fashion with the aim of doing multiple page table updates inside a single mmap_lock write. This has been changed, and the implementation now collapses each hugepage-aligned/sized region iteratively. This was motivated by an experiment which showed that, when multiple threads were concurrently faulting during a MADV_COLLAPSE operation, mean and tail latency to acquire mmap_lock in read for threads in the fault patch was improved by using a batch size of 1 (batch sizes of 1, 8, 16, 32 were tested)[6]. * Added: If a collapse operation fails because a page isn't found on the LRU, do a lru_add_drain_all() and retry. * Added: selftests [1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@googl= e.com/ [2] https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google= .com/ [3] https://github.com/google/tcmalloc/tree/master/tcmalloc [4] https://research.google/pubs/pub50370/ [5] https://lore.kernel.org/lkml/bcc8d9a0-81d-5f34-5e4-fcc28eb7ce@google.co= m/T/ [6] https://lore.kernel.org/linux-mm/CAAa6QmRc76n-dspGT7UK8DkaqZAOz-CkCsME1= V7KGtQ6Yt2FqA@mail.gmail.com/ Zach O'Keefe (15): mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA mm/khugepaged: record SCAN_PMD_MAPPED when scan_pmd() finds THP mm/khugepaged: add struct collapse_control mm/khugepaged: dedup and simplify hugepage alloc and charging mm/khugepaged: make allocation semantics context-specific mm/khugepaged: pipe enum scan_result codes back to callers mm/khugepaged: add flag to ignore khugepaged heuristics mm/khugepaged: add flag to ignore THP sysfs enabled mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse mm/khugepaged: rename prefix of shared collapse functions mm/madvise: add MADV_COLLAPSE to process_madvise() selftests/vm: modularize collapse selftests selftests/vm: add MADV_COLLAPSE collapse context to selftests selftests/vm: add selftest to verify recollapse of THPs tools headers uapi: add MADV_COLLAPSE madvise mode to tools arch/alpha/include/uapi/asm/mman.h | 2 + arch/mips/include/uapi/asm/mman.h | 2 + arch/parisc/include/uapi/asm/mman.h | 2 + arch/xtensa/include/uapi/asm/mman.h | 2 + include/linux/huge_mm.h | 12 + include/trace/events/huge_memory.h | 3 +- include/uapi/asm-generic/mman-common.h | 2 + mm/internal.h | 1 + mm/khugepaged.c | 673 +++++++++++-------- mm/madvise.c | 11 +- mm/rmap.c | 15 +- tools/include/uapi/asm-generic/mman-common.h | 2 + tools/testing/selftests/vm/khugepaged.c | 401 ++++++----- 13 files changed, 679 insertions(+), 449 deletions(-) -- 2.36.1.255.ge46751e96f-goog