From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDD06C433F5 for ; Mon, 16 May 2022 03:56:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 137DE6B0071; Sun, 15 May 2022 23:56:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C2416B0072; Sun, 15 May 2022 23:56:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA40F6B0073; Sun, 15 May 2022 23:56:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D43166B0071 for ; Sun, 15 May 2022 23:56:46 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id A601260FF9 for ; Mon, 16 May 2022 03:56:46 +0000 (UTC) X-FDA: 79470244812.03.5936E56 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf12.hostedemail.com (Postfix) with ESMTP id B8E52400B0 for ; Mon, 16 May 2022 03:56:17 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R201e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04357;MF=rongwei.wang@linux.alibaba.com;NM=1;PH=DS;RN=34;SR=0;TI=SMTPD_---0VDD0fSK_1652673394; Received: from 30.212.152.187(mailfrom:rongwei.wang@linux.alibaba.com fp:SMTPD_---0VDD0fSK_1652673394) by smtp.aliyun-inc.com(127.0.0.1); Mon, 16 May 2022 11:56:38 +0800 Message-ID: Date: Mon, 16 May 2022 11:56:33 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Thunderbird/101.0 Subject: Re: [PATCH v5 10/13] mm/madvise: add MADV_COLLAPSE to process_madvise() Content-Language: en-US To: David Rientjes , Zach O'Keefe Cc: Alex Shi , David Hildenbrand , Matthew Wilcox , Michal Hocko , Pasha Tatashin , Peter Xu , SeongJae Park , Song Liu , Vlastimil Babka , Yang Shi , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Thomas Bogendoerfer , calling@linux.alibaba.com References: <20220504214437.2850685-1-zokeefe@google.com> <20220504214437.2850685-11-zokeefe@google.com> <502a3ced-f3c6-7117-3b24-d80d204d66ee@linux.alibaba.com> <3a53435-5d5a-d6cb-5739-5c444523fc7@google.com> From: Rongwei Wang In-Reply-To: <3a53435-5d5a-d6cb-5739-5c444523fc7@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: 9ppqjd6x47mk9qa5n5wqzp7nfc33pk9i X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: B8E52400B0 Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of rongwei.wang@linux.alibaba.com designates 115.124.30.133 as permitted sender) smtp.mailfrom=rongwei.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com X-Rspam-User: X-HE-Tag: 1652673377-512114 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/13/22 4:03 AM, David Rientjes wrote: > On Wed, 11 May 2022, Zach O'Keefe wrote: > >> Hey Rongwei, >> >> Thanks for taking the time to review! >> >> On Tue, May 10, 2022 at 5:49 PM Rongwei Wang >> wrote: >>> >>> Hi, Zach >>> >>> Thanks for your great patchset! >>> Recently, We also try to collapse THP in this way, likes performance >>> degradation due to using too much hugepages in our scenes. >>> > > Rongwei, could you elaborate on this? I can understand undesired overhead > for allocation of a hugepage at the time of fault if there is not enough > benefit derived by the hugepages over the long term (like for a database > workload), is this the performance degradation you're referring to? > Hi David Sorry for the late reply. Actually, I'm not sure that the degradation is caused by using too much hugepages. Plus, I also referrd to Google's doc[1], and the TLB pressure mentioned here. Maybe the process_madvise(MADV_COLLAPSE) can help us to solve these issues. [1] https://dyninst.github.io/scalable_tools_workshop/petascale2018/assets/slides/CSCADS%202018%20perf_events%20status%20update.pdf -wrw > Otherwise I'm unfamiliar with performance degradation for using too much > hugepages after they have been allocated :) Maybe RSS grows too much and > we run into memory pressure? > > It would be good to know more if you can share details here. > >>> And there is a doubt about process_madvise(MADV_COLLAPSE) when we test >>> this patchset:. It seems that process_madvise(MADV_COLLAPSE) rely on >>> madvise(MADV_HUGEPAGE)? If the vma wasn't marked with 'hg', >>> process_madvise(MADV_COLLAPSE) will fail to collapse. And if I miss >>> something, please let me know. >>> >> >> I tried to have MADV_COLLAPSE follow the same THP eligibility >> semantics as khugepaged and at-fault: either THP=always, or >> THP=madvise and the vma is marked with MADV_HUGEPAGE, as you point >> out. >> >> If I understand you correctly, the usefulness of >> process_madvise(MADV_COLLAPSE) is limited in the case where >> THP=madvise and a CAP_SYS_ADMIN user is requesting a collapse of >> behalf of another process since they don't have a way to mark the >> target memory as eligible (which requires VM_HUGEPAGE). >> >> If so, I think that's a valid point, and your suggestion below of a >> supporting MADV_[NO]HUGEPAGE for process_madvise(2) makes sense. For >> the sake of exploring all options, I'll mention that there was also a >> previous idea suggested by Yang Shi where MADV_COLLAPSE could also set >> VM_HUGEPAGE[1]. >> > > If a user is doing MADV_COLLAPSE on behalf of itself, it seems unnecessary > to need to do MADV_HUGEPAGE before that regardless of system-wide settings > that it may not control? > > Same point for a root user doing this on behalf of another user. It could > either do MADV_HUGEPAGE and change the behavior that the user perhaps > requested behind its back (not desired) or it could temporarily set the > system-wide setting to allow THP always before doing the MADV_COLLAPSE > (it's root). > > We can simply allow MADV_COLLAPSE to always collapse in either context > regardless of VM_HUGEPAGE, VM_NOHUGEPAGE, or system-wide settings? > >> Since it's possible supporting MADV_[NO]HUGEPAGE for >> process_madivse(2) has applications outside a subsequent >> MADV_COLLAPSE, and since I don't see process_madvise(MADV_COLLAPSE) to >> be in a hot path, I'd vote in favor of your suggestion and include >> process_madvise(MADV_[NO]HUGEPAGE) support in v6 unless others object. >> >> Thanks again for your review and your suggestion! >> Zach >> >> [1] https://lore.kernel.org/linux-mm/CAHbLzkqLRBd6u3qn=KqpOhRcPZtpGXbTXLUjK1z=4d_dQ06Pvw@mail.gmail.com/ >> >>> If so, how about introducing process_madvise(MADV_HUGEPAGE) or >>> process_madvise(MADV_NOHUGEPAGE)? The former helps to mark the target >>> vma with 'hg', and the collapse process can be finished completely with >>> the help of other processes. the latter could let some special vma avoid >>> collapsing when setting 'THP=always'. >>> >>> Best regards, >>> -wrw >>> >>> On 5/5/22 5:44 AM, Zach O'Keefe wrote: >>>> Allow MADV_COLLAPSE behavior for process_madvise(2) if caller has >>>> CAP_SYS_ADMIN or is requesting collapse of it's own memory. >>>> >>>> Signed-off-by: Zach O'Keefe >>>> --- >>>> mm/madvise.c | 6 ++++-- >>>> 1 file changed, 4 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/mm/madvise.c b/mm/madvise.c >>>> index 638517952bd2..08c11217025a 100644 >>>> --- a/mm/madvise.c >>>> +++ b/mm/madvise.c >>>> @@ -1168,13 +1168,15 @@ madvise_behavior_valid(int behavior) >>>> } >>>> >>>> static bool >>>> -process_madvise_behavior_valid(int behavior) >>>> +process_madvise_behavior_valid(int behavior, struct task_struct *task) >>>> { >>>> switch (behavior) { >>>> case MADV_COLD: >>>> case MADV_PAGEOUT: >>>> case MADV_WILLNEED: >>>> return true; >>>> + case MADV_COLLAPSE: >>>> + return task == current || capable(CAP_SYS_ADMIN); >>>> default: >>>> return false; >>>> } >>>> @@ -1452,7 +1454,7 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, >>>> goto free_iov; >>>> } >>>> >>>> - if (!process_madvise_behavior_valid(behavior)) { >>>> + if (!process_madvise_behavior_valid(behavior, task)) { >>>> ret = -EINVAL; >>>> goto release_task; >>>> } >>