From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80FC9C433F5 for ; Tue, 22 Mar 2022 12:11:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0DBEC6B0072; Tue, 22 Mar 2022 08:11:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0626B6B0073; Tue, 22 Mar 2022 08:11:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E1F126B0074; Tue, 22 Mar 2022 08:11:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0095.hostedemail.com [216.40.44.95]) by kanga.kvack.org (Postfix) with ESMTP id CB6036B0072 for ; Tue, 22 Mar 2022 08:11:36 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 61A1D8632F for ; Tue, 22 Mar 2022 12:11:36 +0000 (UTC) X-FDA: 79271907792.19.B443B84 Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf27.hostedemail.com (Postfix) with ESMTP id C6D0340018 for ; Tue, 22 Mar 2022 12:11:35 +0000 (UTC) Received: from relay2.suse.de (relay2.suse.de [149.44.160.134]) by smtp-out1.suse.de (Postfix) with ESMTP id 8158B210F8; Tue, 22 Mar 2022 12:11:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1647951094; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=3KD+SWidiziRTL4Hswlk4v7arvCPk8xmRtjh/klHXmE=; b=HlP1kpCJLoSihoARgg4xzGqrZmoco/CAN4YISuAgHb59ZCVaUshDVN6zkuighI7fqBYtpa ylG7AJ9jsMFlGMHRP3VXRo34kYoo9XVz3AHGQwqtUJo7WTlX8YE71CZkuPOseEgibThtAM emjH3/1kLGUm0EwirH63UoN7e70pFxQ= Received: from suse.cz (unknown [10.100.201.86]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by relay2.suse.de (Postfix) with ESMTPS id 510BEA3B81; Tue, 22 Mar 2022 12:11:34 +0000 (UTC) Date: Tue, 22 Mar 2022 13:11:34 +0100 From: Michal Hocko To: Zach O'Keefe Cc: Alex Shi , David Hildenbrand , David Rientjes , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matthew Wilcox , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Peter Xu , Thomas Bogendoerfer , Yang Shi Subject: Re: [RFC PATCH 00/14] mm: userspace hugepage collapse Message-ID: References: <20220308213417.1407042-1-zokeefe@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=suse.com header.s=susede1 header.b=HlP1kpCJ; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf27.hostedemail.com: domain of mhocko@suse.com designates 195.135.220.28 as permitted sender) smtp.mailfrom=mhocko@suse.com X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: C6D0340018 X-Stat-Signature: a97yhjrrgxugccdxzm9j331agwx4oy6u X-HE-Tag: 1647951095-616134 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon 21-03-22 08:46:35, Zach O'Keefe wrote: > Hey Michal, thanks for taking the time to review / comment. >=20 > On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko wrote: > > > > [ Removed Richard Henderson from the CC list as the delivery fails f= or > > his address] >=20 > Thank you :) >=20 > > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote: > > > Introduction > > > -------------------------------- > > > > > > This series provides a mechanism for userspace to induce a collapse= of > > > eligible ranges of memory into transparent hugepages in process con= text, > > > thus permitting users to more tightly control their own hugepage > > > utilization policy at their own expense. > > > > > > This idea was previously introduced by David Rientjes, and thanks t= o > > > everyone for your patience while I prepared these patches resulting= from > > > that discussion[1]. > > > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30B= F@nvidia.com/ > > > > > > Interface > > > -------------------------------- > > > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, a= nd > > > leverages the new process_madvise(2) call. > > > > > > (*) process_madvise(2) > > > > > > Performs a synchronous collapse of the native pages mapped = by > > > the list of iovecs into transparent hugepages. The default = gfp > > > flags used will be the same as those used at-fault for the = VMA > > > region(s) covered. > > > > Could you expand on reasoning here? The default allocation mode for #= PF > > is rather light. Madvised will try harder. The reasoning is that we w= ant > > to make stalls due to #PF as small as possible and only try harder fo= r > > madvised areas (also a subject of configuration). Wouldn't it make mo= re > > sense to try harder for an explicit calls like madvise? > > >=20 > The reasoning is that the user has presumably configured system/vmas > to tell the kernel how badly they want thps, and so this call aligns > with current expectations. I.e. a user who goes about the trouble of > trying to fault-in a thp at a given memory address likely wants a thp > "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a > thp. If the syscall tries only as hard as the #PF doesn't that limit the functionality? I mean a non #PF can consume more resources to allocate and collapse a THP as it won't inflict any measurable latency to the targetting process (except for potential CPU contention). From that perspective madvise is much more similar to khugepaged. I would even argue that it could try even harder because madvise is focused on a very specific memory range and the execution is not shared among all processes that are scanned by khugepaged. > If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be > used to explicitly request the kernel to try harder, as you mention. Do we really need that? How many do_harder levels do we want to support? What would be typical usecases for #PF based and DEFRAG usages? [...] > > > Diverging from the at-fault semantics, VM_NOHUGEPAGE is ign= ored > > > by default, as the user is explicitly requesting this actio= n. > > > Define two flags to control collapse semantics, passed thro= ugh > > > process_madvise(2)=E2=80=99s optional flags parameter: > > > > This part is discussed later in the thread. > > > > > > > > MADV_F_COLLAPSE_LIMITS > > > > > > If supplied, collapse respects pte collapse limits set via > > > sysfs: > > > /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared= ]. > > > Required if calling on behalf of another process and not > > > CAP_SYS_ADMIN. > > > > > > MADV_F_COLLAPSE_DEFRAG > > > > > > If supplied, permit synchronous compaction and reclaim, > > > regardless of VMA flags. > > > > Why do we need this? >=20 > Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both? >=20 > * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of > inter-process protection for collapsing memory in another process' > address space (which a malevolent program could exploit to cause oom > conditions in another memcg hierarchy, for example), but we want > privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp > utilization as they wish. Could you expand some more please? How is this any different from khugepaged (well, except that you can trigger the collapsing explicitly rather than rely on khugepaged to find that mm)? > * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want > to explicitly tell the kernel to try harder to back this by thps, > regardless of the current system/vma configuration. >=20 > Note that when used together, these flags can be used to implement the > exact behavior of khugepaged, through MADV_COLLAPSE. IMHO this is stretching the interface and this can backfire in the future. The interface should be really trivial. I want to collapse a memory area. Let the kernel do the right thing and do not bother with all the implementation details. I would use the same allocation strategy as khugepaged as this seems to be closesest from the latency and application awareness POV. In a way you can look at the madvise call as a way to trigger khugepaged functionality on he particular memory range. --=20 Michal Hocko SUSE Labs