From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C2C1C433DB for ; Thu, 18 Feb 2021 22:35:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 0186964EB4 for ; Thu, 18 Feb 2021 22:35:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0186964EB4 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 78F226B006E; Thu, 18 Feb 2021 17:35:00 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 73F366B0070; Thu, 18 Feb 2021 17:35:00 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 630298D0001; Thu, 18 Feb 2021 17:35:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0137.hostedemail.com [216.40.44.137]) by kanga.kvack.org (Postfix) with ESMTP id 49BBA6B006E for ; Thu, 18 Feb 2021 17:35:00 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 07678365F for ; Thu, 18 Feb 2021 22:35:00 +0000 (UTC) X-FDA: 77832845160.25.9493CAA Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) by imf02.hostedemail.com (Postfix) with ESMTP id F0C30407F8DB for ; Thu, 18 Feb 2021 22:34:50 +0000 (UTC) Received: by mail-pf1-f169.google.com with SMTP id z15so2378027pfc.3 for ; Thu, 18 Feb 2021 14:34:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version; bh=PjuW2uBQD89P7RlCcE9r4w/f8p7EZ47g6UObZWyrNJs=; b=wGEj5oeVZcX/yASMYkFItFWu8teDIDt3qsz2z2Eckh+c4GcTBYNGPik/MmeX6X7Itt l/HAkR4xInyJw9VzdpjWP5bkmqfhvWM9QUt2hhSq8i1GA2zVG8JkwvShcpYji4fUSsm2 AVBjGBgIgTh11DHkfdtNQAla1bOsbzplVL9eKlGpiOeR3GDLdUQQnTSl5xlDCFeERAa+ yi/c25EfYLilLBfBPzxhlUm/1JHR0aK0l+LOjtygmjYkbYjhw8Q4XQQ5stGvORfxruCu fTT6dZbRvWiB/6wb9bCaKBbny6kY4S8i+5gnVRTNAPrydDAuogw4+PEqfeiTA4j7pElM PLig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:mime-version; bh=PjuW2uBQD89P7RlCcE9r4w/f8p7EZ47g6UObZWyrNJs=; b=Qofe/S4+eqeVuRdzVdxMkMHH/smZeMcWDkpqZ8UfFKH6RX4dT330NVeATVbV8XWNEQ 3Gd4qR2hOEOdu9NANKAGDkorFiss6HjXBAeiW4FN8nHgHKLLifCwZn1XtrN/nqMTTpZs sLOZ9sqfqM1gsmK3f9vrjnQEl51ttFjeCFfw0RnUzDfnKmky32OPtJCZTI5WGqQ/t1sq hpUA5gWAeo0sSSZdJ88xjpxxA5FkLE045I7dYJlvbZa93gDzCfdaj7yV+/XNbsYukz5o HgB5KdTIWL3s1nMuaNWXPM8BrWa1zpvTPZOjLufyZ4VkdhLHdQ0niMV7UMok83St0whL spzQ== X-Gm-Message-State: AOAM5337niFb5+3vZ8XYrreRqSdgmcH2bHntSc4ZnG5GeYi+ZDIr3GWN /MERS0kOtcwCR7Bv9jKzut4LwQ== X-Google-Smtp-Source: ABdhPJxVlpmz7MpPiyDQTWu8x7EZzrlS+vaPUHimF82ElG73KtaZFLiH+u+1SUdU/WohO7JpUecsgQ== X-Received: by 2002:a05:6a00:1693:b029:1ec:b0af:d1d with SMTP id k19-20020a056a001693b02901ecb0af0d1dmr6476442pfc.42.1613687698363; Thu, 18 Feb 2021 14:34:58 -0800 (PST) Received: from [2620:15c:17:3:b498:93bf:c9f9:831d] ([2620:15c:17:3:b498:93bf:c9f9:831d]) by smtp.gmail.com with ESMTPSA id w187sm7100987pgb.52.2021.02.18.14.34.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Feb 2021 14:34:57 -0800 (PST) Date: Thu, 18 Feb 2021 14:34:56 -0800 (PST) From: David Rientjes To: David Hildenbrand cc: Vlastimil Babka , Michal Hocko , Alex Shi , Hugh Dickins , Andrea Arcangeli , "Kirill A. Shutemov" , Song Liu , Matthew Wilcox , Minchan Kim , Chris Kennelly , linux-mm@kvack.org, linux-api@vger.kernel.org Subject: Re: [RFC] Hugepage collapse in process context In-Reply-To: <600ee57f-d839-d402-fb0f-e9f350114dce@redhat.com> Message-ID: <5127b9c-a147-8ef5-c942-ae8c755413d0@google.com> References: <0b51a213-650e-7801-b6ed-9545466c15db@suse.cz> <600ee57f-d839-d402-fb0f-e9f350114dce@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-Stat-Signature: tn5i65rtqx19ybreefz15mn815x3adtk X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: F0C30407F8DB Received-SPF: none (google.com>: No applicable sender policy available) receiver=imf02; identity=mailfrom; envelope-from=""; helo=mail-pf1-f169.google.com; client-ip=209.85.210.169 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1613687690-625406 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, 18 Feb 2021, David Hildenbrand wrote: > > > > Hi everybody, > > > > > > > > Khugepaged is slow by default, it scans at most 4096 pages every 10s. > > > > That's normally fine as a system-wide setting, but some applications > > > > would > > > > benefit from a more aggressive approach (as long as they are willing to > > > > pay for it). > > > > > > > > Instead of adding priorities for eligible ranges of memory to > > > > khugepaged, > > > > temporarily speeding khugepaged up for the whole system, or sharding its > > > > work for memory belonging to a certain process, one approach would be to > > > > allow userspace to induce hugepage collapse. > > > > > > > > The benefit to this approach would be that this is done in process > > > > context > > > > so its cpu is charged to the process that is inducing the collapse. > > > > Khugepaged is not involved. > > > > > > Yes, this makes a lot of sense to me. > > > > > > > Idea was to allow userspace to induce hugepage collapse through the new > > > > process_madvise() call. This allows us to collapse hugepages on behalf > > > > of > > > > current or another process for a vectored set of ranges. > > > > > > Yes, madvise sounds like a good fit for the purpose. > > > > Agreed on both points. > > > > > > This could be done through a new process_madvise() mode *or* it could be > > > > a > > > > flag to MADV_HUGEPAGE since process_madvise() allows for a flag > > > > parameter > > > > to be passed. For example, MADV_F_SYNC. > > > > > > Would this MADV_F_SYNC be applicable to other madvise modes? Most > > > existing madvise modes do not seem to make much sense. We can argue that > > > MADV_PAGEOUT would guarantee the range was indeed reclaimed but I am not > > > sure we want to provide such a strong semantic because it can limit > > > future reclaim optimizations. > > > > > > To me MADV_HUGEPAGE_COLLAPSE sounds like the easiest way forward. > > > > I guess in the old madvise(2) we could create a new combo of MADV_HUGEPAGE | > > MADV_WILLNEED with this semantic? But you are probably more interested in > > process_madvise() anyway. There the new flag would make more sense. But > > there's > > also David H.'s proposal for MADV_POPULATE and there might be benefit in > > considering both at the same time? Should e.g. MADV_POPULATE with > > MADV_HUGEPAGE > > have the collapse semantics? But would MADV_POPULATE be added to > > process_madvise() as well? Just thinking out loud so we don't end up with > > more > > flags than necessary, it's already confusing enough as it is. > > > > Note that madvise() eats only a single value, not flags. Combinations as you > describe are not possible. > > Something MADV_HUGEPAGE_COLLAPSE make sense to me that does not need the mmap > lock in write and does not modify the actual VMA, only a mapping. > Agreed, and happy to see that there's a general consensus for the direction. Benefit of a new madvise mode is that it can be used for madvise() as well if you are interested in only a single range of your own memory and then it doesn't need to reconcile with any of the already overloaded semantics of MADV_HUGEPAGE. Otherwise, process_madvise() can be used for other processes and/or vectored ranges. Song's use case for this to prioritize thp usage is very important for us as well. I hadn't thought of the madvise(MADV_HUGEPAGE) + madvise(MADV_HUGEPAGE_COLLAPSE) use case: I was anticipating the latter would allocate the hugepage with khugepaged's gfp mask so it would always compact. But it seems like this would actually be better to use the gfp mask that would be used at fault for the vma and left to userspace to determine whether that's MADV_HUGEPAGE or not. Makes sense. (Userspace could even do madvise(MADV_NOHUGEPAGE) + madvise(MADV_HUGEPAGE_COLLAPSE) to do the synchronous collapse but otherwise exclude it from khugepaged's consideration if it were inclined.) Two other minor points: - Currently, process_madvise() doesn't use the flags parameter at all so there's the question of whether we need generalized flags that apply to most madvise modes or whether the flags can be specific to the mode being used. For example, a natural extension of this new mode would be to determine the hugepage size if we were ever to support synchronous collapse into a 1GB gigantic page on x86 (MADV_F_1GB? :) - We haven't discussed the future of khugepaged with this new mode: it seems like we could simply implement khugepaged fully in userspace and remove it from the kernel? :)