From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBD70C433EF for ; Wed, 9 Mar 2022 23:44:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 46BB98D0002; Wed, 9 Mar 2022 18:44:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F3358D0001; Wed, 9 Mar 2022 18:44:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 247068D0002; Wed, 9 Mar 2022 18:44:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0203.hostedemail.com [216.40.44.203]) by kanga.kvack.org (Postfix) with ESMTP id 123FD8D0001 for ; Wed, 9 Mar 2022 18:44:07 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id BA87D18194D69 for ; Wed, 9 Mar 2022 23:44:06 +0000 (UTC) X-FDA: 79226478492.23.DC090B1 Received: from mail-pf1-f178.google.com (mail-pf1-f178.google.com [209.85.210.178]) by imf02.hostedemail.com (Postfix) with ESMTP id 43CB48001F for ; Wed, 9 Mar 2022 23:44:06 +0000 (UTC) Received: by mail-pf1-f178.google.com with SMTP id s11so3579670pfu.13 for ; Wed, 09 Mar 2022 15:44:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=DQentByDYT+qwgn35lmnTHE3eNj5flXuymF7GBrYrCs=; b=oHaanOmSIYdL1894XqcdGhOX+NbARfz2Qpll37sd7G/rA2S4XJeyY99kOVKMaq9Mjc tRx4HxPMuOcIx0NOcHXAUnvHfto2Xg3vyc0i+zfcEszmmKIZg3hAiqxCr/GXzoiJlJni dRPh9e1JvxQib0VyVtvG6aBxcCZYtyXvncuPAFiHEwouC1OKsRTtdYO0Ue9A4lnPuWNV kvMs8ye84R1x8Y4w6aPm0cQh98RL8aId1jJ/F4K2PGpRJE/YL+qdYsi1CDlLeAohZqk5 pFFIsiS9EvpSmuyY+rAQbUmTZNeqYZzzHS4lGXGIe15JWy/7PSMUPd1OKc3rdXb/VIDQ blFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=DQentByDYT+qwgn35lmnTHE3eNj5flXuymF7GBrYrCs=; b=iLmgVjXCq54uNgUCVXuf/zZL6HM0MgABXBcIlkpN6D2tLnqUQC4Pks+W1/xPTHKTkn um4jvEGUT2mIC8KNufxETwBlmooidXS/BMxBNasjvfMtEbSrGJjimXgwQi/hNfP/zICj ei1nUdhqtkfgUH+fU7wGNhDadQlDwDEuy19RQxwCmoxsA4Q6PnBN99bnQ4Kps63R4Q5j NGaD8QXzkUQszPtJ3U/y0SNDblMrb5GQtnQC9eZBissT4NfNknc/ud624U9L2B0QItrg ki0T2FzUEc+H/Q2D2kvt8XX3vxubgjzECIlvUzzX2EULQX+m31OVXBQ7TKFdPteIlUFj vNRg== X-Gm-Message-State: AOAM530LEsa/KGEIR6IRYUM1UNEXlxqmmCsCvE7LyBbdbR3+XM/DQcmJ BWaAnQ9UVWiyZWaB/tqve2f+dqCx6aV+SM9aXw8= X-Google-Smtp-Source: ABdhPJxr61mW7uBQ14HkNXriIy1xiAgJdZqa4Xdt3jkknf3duYI1rNEpw9Z6QMRbqpWkip5mZ7Fy9Cl2qq0gWI6aZMM= X-Received: by 2002:a05:6a00:1248:b0:4f7:db0:4204 with SMTP id u8-20020a056a00124800b004f70db04204mr1946199pfi.27.1646869445338; Wed, 09 Mar 2022 15:44:05 -0800 (PST) MIME-Version: 1.0 References: <20220308213417.1407042-1-zokeefe@google.com> <20220308213417.1407042-12-zokeefe@google.com> In-Reply-To: <20220308213417.1407042-12-zokeefe@google.com> From: Yang Shi Date: Wed, 9 Mar 2022 15:43:53 -0800 Message-ID: Subject: Re: [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse To: "Zach O'Keefe" Cc: Alex Shi , David Hildenbrand , David Rientjes , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , Linux MM , Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matthew Wilcox , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Peter Xu , Richard Henderson , Thomas Bogendoerfer Content-Type: text/plain; charset="UTF-8" X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 43CB48001F X-Stat-Signature: xjjejr6hjs51ndkkszbhdcjina8n6foj Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=oHaanOmS; spf=pass (imf02.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.178 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1646869446-960635 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe wrote: > > The idea of hugepage collapse in process context was previously > introduced by David Rientjes to linux-mm[1]. > > The idea is to introduce a new madvise mode, MADV_COLLAPSE, that allows > users to request a synchronous collapse of memory. > > The benefits of this approach are: > > * cpu is charged to the process that wants to spend the cycles for the > THP > * avoid unpredictable timing of khugepaged collapse > * flexible separation of sync userspace and async khugepaged THP collapse > policies > > Immediate users of this new functionality include: > > * malloc implementations that manage memory in hugepage-sized chunks, > but sometimes subrelease memory back to the system in native-sized > chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory > is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the > memory by THP to regain TLB performance. > * immediately back executable text by hugepages. Current support > provided by CONFIG_READ_ONLY_THP_FOR_FS may take too long on a large > system. > > To keep patches digestible, introduce MADV_COLLAPSE in a few stages. > > Add plumbing to existing madvise infrastructure, as well as populate > uapi header files, leaving the actual madvise(MADV_COLLAPSE) handler > stubbed out. Only privately-mapped anon memory is supported for now. > > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ > > Signed-off-by: Zach O'Keefe > --- > include/linux/huge_mm.h | 12 +++++++ > include/uapi/asm-generic/mman-common.h | 2 ++ > mm/khugepaged.c | 46 ++++++++++++++++++++++++++ > mm/madvise.c | 5 +++ > 4 files changed, 65 insertions(+) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index fd905b0b2c71..407b63ab4185 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -226,6 +226,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, > > int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > int advice); > +int madvise_collapse(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end); > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > unsigned long end, long adjust_next); > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > @@ -383,6 +386,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > BUG(); > return 0; > } > + > +static inline int madvise_collapse(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) > +{ > + BUG(); > + return 0; > +} > + > static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, > unsigned long start, > unsigned long end, > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index 6c1aa92a92e4..6ce1f1ceb432 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -77,6 +77,8 @@ > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 12ae765c5c32..ca1e523086ed 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -2519,3 +2519,49 @@ void khugepaged_min_free_kbytes_update(void) > set_recommended_min_free_kbytes(); > mutex_unlock(&khugepaged_mutex); > } > + > +/* > + * Returns 0 if successfully able to collapse range into THPs (or range already > + * backed by THPs). Due to implementation detail, THPs collapsed here may be > + * split again before this function returns. > + */ > +static int _madvise_collapse(struct mm_struct *mm, > + struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, > + unsigned long end, gfp_t gfp, > + struct collapse_control *cc) > +{ > + /* Implemented in later patch */ Just like the earlier patches, as long as you introduce a new function, it is better to keep it with its users in the same patch. And typically we don't do the "implement in the later patch" thing, it makes review harder. > + return -ENOSYS; > +} > + > +int madvise_collapse(struct vm_area_struct *vma, > + struct vm_area_struct **prev, unsigned long start, > + unsigned long end) > +{ > + struct collapse_control cc; > + gfp_t gfp; > + int error; > + struct mm_struct *mm = vma->vm_mm; > + > + /* Requested to hold mmap_lock in read */ > + mmap_assert_locked(mm); > + > + mmgrab(mm); > + collapse_control_init(&cc, /* enforce_pte_scan_limits= */ false); > + gfp = vma_thp_gfp_mask(vma); > + lru_add_drain(); /* lru_add_drain_all() too heavy here */ > + error = _madvise_collapse(mm, vma, prev, start, end, gfp, &cc); > + mmap_assert_locked(mm); > + mmdrop(mm); > + > + /* > + * madvise() returns EAGAIN if kernel resources are temporarily > + * unavailable. > + */ > + if (error == -ENOMEM) > + error = -EAGAIN; > + > + return error; > +} > diff --git a/mm/madvise.c b/mm/madvise.c > index 5b6d796e55de..292aa017c150 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -58,6 +58,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_FREE: > case MADV_POPULATE_READ: > case MADV_POPULATE_WRITE: > + case MADV_COLLAPSE: > return 0; > default: > /* be safe, default to 1. list exceptions explicitly */ > @@ -1046,6 +1047,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > if (error) > goto out; > break; > + case MADV_COLLAPSE: > + return madvise_collapse(vma, prev, start, end); > } > > anon_name = anon_vma_name(vma); > @@ -1139,6 +1142,7 @@ madvise_behavior_valid(int behavior) > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > case MADV_HUGEPAGE: > case MADV_NOHUGEPAGE: > + case MADV_COLLAPSE: > #endif > case MADV_DONTDUMP: > case MADV_DODUMP: > @@ -1328,6 +1332,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > * MADV_NOHUGEPAGE - mark the given range as not worth being backed by > * transparent huge pages so the existing pages will not be > * coalesced into THP and new pages will not be allocated as THP. > + * MADV_COLLAPSE - synchronously coalesce pages into new THP. > * MADV_DONTDUMP - the application wants to prevent pages in the given range > * from being included in its core dump. > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > -- > 2.35.1.616.g0bdcbb4464-goog >