From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 165E4C43461 for ; Wed, 9 Sep 2020 04:01:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8A22420EDD for ; Wed, 9 Sep 2020 04:01:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="M4sHywYi" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8A22420EDD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B17796B0071; Wed, 9 Sep 2020 00:01:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AC7DD6B0073; Wed, 9 Sep 2020 00:01:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DE028E0001; Wed, 9 Sep 2020 00:01:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0140.hostedemail.com [216.40.44.140]) by kanga.kvack.org (Postfix) with ESMTP id 8828B6B0071 for ; Wed, 9 Sep 2020 00:01:19 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 46E613628 for ; Wed, 9 Sep 2020 04:01:19 +0000 (UTC) X-FDA: 77242173078.18.aunt00_2204c1e270da Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin18.hostedemail.com (Postfix) with ESMTP id 1D33A100ED0C7 for ; Wed, 9 Sep 2020 04:01:19 +0000 (UTC) X-HE-Tag: aunt00_2204c1e270da X-Filterd-Recvd-Size: 5751 Received: from hqnvemgate26.nvidia.com (hqnvemgate26.nvidia.com [216.228.121.65]) by imf40.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Sep 2020 04:01:18 +0000 (UTC) Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqnvemgate26.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Tue, 08 Sep 2020 21:01:03 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Tue, 08 Sep 2020 21:01:17 -0700 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Tue, 08 Sep 2020 21:01:17 -0700 Received: from [10.2.54.52] (10.124.1.5) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Wed, 9 Sep 2020 04:01:07 +0000 Subject: Re: [RFC PATCH 00/16] 1GB THP support on x86_64 To: Roman Gushchin , Zi Yan CC: Michal Hocko , , Rik van Riel , "Kirill A . Shutemov" , Matthew Wilcox , Shakeel Butt , Yang Shi , David Nellans , References: <20200902180628.4052244-1-zi.yan@sent.com> <20200903073254.GP4617@dhcp22.suse.cz> <20200903162527.GF60440@carbon.dhcp.thefacebook.com> <20200904074207.GC15277@dhcp22.suse.cz> <20200904211045.GA567128@carbon.DHCP.thefacebook.com> <20200907072014.GD30144@dhcp22.suse.cz> <3CDAD67E-23A1-4D84-BF19-FFE1CF956779@nvidia.com> <20200908195859.GC567128@carbon.DHCP.thefacebook.com> From: John Hubbard Message-ID: Date: Tue, 8 Sep 2020 21:01:07 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: <20200908195859.GC567128@carbon.DHCP.thefacebook.com> X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL101.nvidia.com (172.20.187.10) To HQMAIL107.nvidia.com (172.20.187.13) Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1599624063; bh=dmRq6pPL7lsu+1BBewRPvrdpoyaWLX1m2BQUzqJSQ9I=; h=X-PGP-Universal:Subject:To:CC:References:From:Message-ID:Date: User-Agent:MIME-Version:In-Reply-To:X-Originating-IP: X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=M4sHywYiL0FR9Sr3mf87Zm3japsLXA8Ko7pyj0m0x8ykakwPAItGz/GLAtkYfIARt M6SjB3hPKv5v09INfUxokMCuV/EtHWjsdBTZZn5hnpHajFzPMxKGOXYChGtLwk8p2f WYUsP//nXhZCfVAH+6ndsW80kht5z8teMtH1fjyOsc9RV6SovvrCDNc16Zaz2CbSop 5mVlWLwJPHhotRiSR6atRsPDjGB6BqY9ER3I6ql3wXy4sg4EZmbEb3IW80wmypHQi+ FxD9g/PSzznk4GGjHl+uDJN7I/JUsylpdseXn7+9Hgw3Z18KaK8LI7Z99z209Fgp// +gKPAH9Xqhk1g== X-Rspamd-Queue-Id: 1D33A100ED0C7 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 9/8/20 12:58 PM, Roman Gushchin wrote: > On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote: >> On 7 Sep 2020, at 3:20, Michal Hocko wrote: >>> On Fri 04-09-20 14:10:45, Roman Gushchin wrote: >>>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: >>> [...] >> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have >> better and clearer control of getting huge pages from the kernel and >> know when they will pay the cost of getting the huge pages. >> >> I would think the suggestion is more about the huge page control options >> currently provided by the kernel do not have predictable performance >> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell >> users whether the marked virtual address range is backed by huge pages >> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a >> deterministic result to users on whether the huge page(s) are formed >> or not. > > Yeah, I agree with Michal here, we need a more straightforward interface. > > The hard question here is how hard the kernel should try to allocate > a gigantic page and how fast it should give up and return an error? > I'd say to try really hard if there are some chances to succeed, > so that if an error is returned, there are no more reasons to retry. > Any objections/better ideas here? I agree, especially because this is starting to look a lot more like an allocation call. And I think it would be appropriate for the kernel to try approximately as hard to provide these 1GB pages, as it would to allocate normal memory to a process. In fact, for a moment I thought, why not go all the way and make this actually be a true allocation? However, given that we still have operations that require page splitting, with no good way to call back user space to notify it that its "allocated" huge pages are being split, that fails. But it's still pretty close. > > Given that we need to pass a page size, we probably need either to introduce > a new syscall (madvise2?) with an additional argument, or add a bunch > of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc. > > Idk what is better long-term, but new madvise flags are probably slightly > easier to deal with in the development process. > Probably either an MADV_* flag or a new syscall would work fine. But given that this seems like a pretty distinct new capability, one with options and man page documentation and possibly future flags itself, I'd lean toward making it its own new syscall, maybe: compact_huge_pages(nbytes or npages, flags /* page size, etc */); ...thus leaving madvise() and it's remaining flags still available, to further refine things. thanks, -- John Hubbard NVIDIA