From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8391DC433F5 for ; Thu, 10 Mar 2022 01:12:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E93B08D0002; Wed, 9 Mar 2022 20:12:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E411B8D0001; Wed, 9 Mar 2022 20:12:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CE3248D0002; Wed, 9 Mar 2022 20:12:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0121.hostedemail.com [216.40.44.121]) by kanga.kvack.org (Postfix) with ESMTP id BD2238D0001 for ; Wed, 9 Mar 2022 20:12:11 -0500 (EST) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6B79C9F846 for ; Thu, 10 Mar 2022 01:12:11 +0000 (UTC) X-FDA: 79226700462.20.7F0F987 Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com [209.85.167.50]) by imf16.hostedemail.com (Postfix) with ESMTP id EB4B618000C for ; Thu, 10 Mar 2022 01:12:10 +0000 (UTC) Received: by mail-lf1-f50.google.com with SMTP id 3so6789518lfr.7 for ; Wed, 09 Mar 2022 17:12:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=eFY7AkbFGtN4SrXW6no8QHR95ZaVdZZcLeQs+Hr+3yo=; b=WEraYN+u+NwoWP5BfeDoARkdsfAKEpWh5pi+Ys4nNHafdCDVLmB4HxvVEwl15ZrkCs KVByBxFZtSoia01XfFYUUOoZrr1IM1iUR1pGl06ioZH1bzkMPGuE8thP0LJ/2FCRP5zG 19Re/N5djXprN+ZcwOpoSTV8yFajIl0KG9Yn7ZY0nfAaWs5s8mQ4o8fF4AUJE93t2XxX ukVBtvQ2tzlr9V9wCF3pT2xkDdIWCaL+dvxS0tYoPcP8w9BshVH1y1eilb/mQXMXSKsg vNuu73t2pP8GA8r9yFoPGpSNpXU4h5EuQ3+ywploaTpIm7IlQxXS+75maP5qbzfhjVX9 WB0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=eFY7AkbFGtN4SrXW6no8QHR95ZaVdZZcLeQs+Hr+3yo=; b=UQjpzfZdbKO03iK99kEmZ2o22l+ssS7aqhAsPnhORwFBECOctTx59gdWG87b25dh1a 40tX9Cxp83Ti/sdi/rZ4JgxOroh47g/r6guy3+ZWhCpNxIUOrrxFge3UJ/fFNJkLq4Xd uFU8NdNMJ2TzfSweeMeKrkx+ldpVHUMYiZN+LMKQkZSy5q+BknK+H6YjH1zgmJ2lR6I2 8AljAAT1eFotBuKPvQ0T47ZMFf3rFcF+jB8ryR7aoFAJXnnCa5kSbT1+mrmei2lPi6j4 u5boX4wWOlgIg/mE/oHuXJwokq3OdMh2twJFj27cBo9BN5BgR3bTX0Q3FLkCXRxGq58W oYsQ== X-Gm-Message-State: AOAM533D9oOTHVtTb6l9SMIQ75ewitGrR2ZevtvdHF99p4bC2iKFgvZw cwZrHZ8gl7StlgVU09LsFU11FifkUeSPTqkaCwClLw== X-Google-Smtp-Source: ABdhPJz7FD02HUj+sdjMf4MgB9CRF+dET1rKg5QAn5rkz7BfzKHc4EMk0/XP/G4w0lmfQNjiigvqM29xapJr6q7CK48= X-Received: by 2002:ac2:48a3:0:b0:448:2ad6:b58d with SMTP id u3-20020ac248a3000000b004482ad6b58dmr1415121lfg.60.1646874729152; Wed, 09 Mar 2022 17:12:09 -0800 (PST) MIME-Version: 1.0 References: <20220308213417.1407042-1-zokeefe@google.com> <20220308213417.1407042-12-zokeefe@google.com> In-Reply-To: From: "Zach O'Keefe" Date: Wed, 9 Mar 2022 17:11:32 -0800 Message-ID: Subject: Re: [RFC PATCH 11/14] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse To: Yang Shi Cc: Alex Shi , David Hildenbrand , David Rientjes , Michal Hocko , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , Linux MM , Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matthew Wilcox , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Peter Xu , Richard Henderson , Thomas Bogendoerfer Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: EB4B618000C X-Stat-Signature: pbpkezeenoo65cgzfxxusgkykfdpnofd X-Rspam-User: Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=WEraYN+u; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of zokeefe@google.com designates 209.85.167.50 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Rspamd-Server: rspam03 X-HE-Tag: 1646874730-994391 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey Yang. Ack this, as well similar feedback from earlier in the series. Really was just trying to avoid a giant monolithic patch that would also make review harder. I can rework though. Thank you On Wed, Mar 9, 2022 at 3:44 PM Yang Shi wrote: > > On Tue, Mar 8, 2022 at 1:35 PM Zach O'Keefe wrote: > > > > The idea of hugepage collapse in process context was previously > > introduced by David Rientjes to linux-mm[1]. > > > > The idea is to introduce a new madvise mode, MADV_COLLAPSE, that allows > > users to request a synchronous collapse of memory. > > > > The benefits of this approach are: > > > > * cpu is charged to the process that wants to spend the cycles for the > > THP > > * avoid unpredictable timing of khugepaged collapse > > * flexible separation of sync userspace and async khugepaged THP collapse > > policies > > > > Immediate users of this new functionality include: > > > > * malloc implementations that manage memory in hugepage-sized chunks, > > but sometimes subrelease memory back to the system in native-sized > > chunks via MADV_DONTNEED; zapping the pmd. Later, when the memory > > is hot, the implementation could madvise(MADV_COLLAPSE) to re-back the > > memory by THP to regain TLB performance. > > * immediately back executable text by hugepages. Current support > > provided by CONFIG_READ_ONLY_THP_FOR_FS may take too long on a large > > system. > > > > To keep patches digestible, introduce MADV_COLLAPSE in a few stages. > > > > Add plumbing to existing madvise infrastructure, as well as populate > > uapi header files, leaving the actual madvise(MADV_COLLAPSE) handler > > stubbed out. Only privately-mapped anon memory is supported for now. > > > > [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ > > > > Signed-off-by: Zach O'Keefe > > --- > > include/linux/huge_mm.h | 12 +++++++ > > include/uapi/asm-generic/mman-common.h | 2 ++ > > mm/khugepaged.c | 46 ++++++++++++++++++++++++++ > > mm/madvise.c | 5 +++ > > 4 files changed, 65 insertions(+) > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index fd905b0b2c71..407b63ab4185 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -226,6 +226,9 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, > > > > int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, > > int advice); > > +int madvise_collapse(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end); > > void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start, > > unsigned long end, long adjust_next); > > spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma); > > @@ -383,6 +386,15 @@ static inline int hugepage_madvise(struct vm_area_struct *vma, > > BUG(); > > return 0; > > } > > + > > +static inline int madvise_collapse(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end) > > +{ > > + BUG(); > > + return 0; > > +} > > + > > static inline void vma_adjust_trans_huge(struct vm_area_struct *vma, > > unsigned long start, > > unsigned long end, > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index 6c1aa92a92e4..6ce1f1ceb432 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -77,6 +77,8 @@ > > > > #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */ > > > > +#define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 12ae765c5c32..ca1e523086ed 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -2519,3 +2519,49 @@ void khugepaged_min_free_kbytes_update(void) > > set_recommended_min_free_kbytes(); > > mutex_unlock(&khugepaged_mutex); > > } > > + > > +/* > > + * Returns 0 if successfully able to collapse range into THPs (or range already > > + * backed by THPs). Due to implementation detail, THPs collapsed here may be > > + * split again before this function returns. > > + */ > > +static int _madvise_collapse(struct mm_struct *mm, > > + struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, > > + unsigned long end, gfp_t gfp, > > + struct collapse_control *cc) > > +{ > > + /* Implemented in later patch */ > > Just like the earlier patches, as long as you introduce a new > function, it is better to keep it with its users in the same patch. > And typically we don't do the "implement in the later patch" thing, it > makes review harder. > > > + return -ENOSYS; > > +} > > + > > +int madvise_collapse(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, unsigned long start, > > + unsigned long end) > > +{ > > + struct collapse_control cc; > > + gfp_t gfp; > > + int error; > > + struct mm_struct *mm = vma->vm_mm; > > + > > + /* Requested to hold mmap_lock in read */ > > + mmap_assert_locked(mm); > > + > > + mmgrab(mm); > > + collapse_control_init(&cc, /* enforce_pte_scan_limits= */ false); > > + gfp = vma_thp_gfp_mask(vma); > > + lru_add_drain(); /* lru_add_drain_all() too heavy here */ > > + error = _madvise_collapse(mm, vma, prev, start, end, gfp, &cc); > > + mmap_assert_locked(mm); > > + mmdrop(mm); > > + > > + /* > > + * madvise() returns EAGAIN if kernel resources are temporarily > > + * unavailable. > > + */ > > + if (error == -ENOMEM) > > + error = -EAGAIN; > > + > > + return error; > > +} > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 5b6d796e55de..292aa017c150 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -58,6 +58,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_FREE: > > case MADV_POPULATE_READ: > > case MADV_POPULATE_WRITE: > > + case MADV_COLLAPSE: > > return 0; > > default: > > /* be safe, default to 1. list exceptions explicitly */ > > @@ -1046,6 +1047,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, > > if (error) > > goto out; > > break; > > + case MADV_COLLAPSE: > > + return madvise_collapse(vma, prev, start, end); > > } > > > > anon_name = anon_vma_name(vma); > > @@ -1139,6 +1142,7 @@ madvise_behavior_valid(int behavior) > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > case MADV_HUGEPAGE: > > case MADV_NOHUGEPAGE: > > + case MADV_COLLAPSE: > > #endif > > case MADV_DONTDUMP: > > case MADV_DODUMP: > > @@ -1328,6 +1332,7 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start, > > * MADV_NOHUGEPAGE - mark the given range as not worth being backed by > > * transparent huge pages so the existing pages will not be > > * coalesced into THP and new pages will not be allocated as THP. > > + * MADV_COLLAPSE - synchronously coalesce pages into new THP. > > * MADV_DONTDUMP - the application wants to prevent pages in the given range > > * from being included in its core dump. > > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > > -- > > 2.35.1.616.g0bdcbb4464-goog > >