From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1CE0C433E0 for ; Thu, 18 Feb 2021 10:01:34 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 563C964E79 for ; Thu, 18 Feb 2021 10:01:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 563C964E79 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=suse.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id BED196B006E; Thu, 18 Feb 2021 05:01:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BC4146B0070; Thu, 18 Feb 2021 05:01:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ADB836B0071; Thu, 18 Feb 2021 05:01:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0157.hostedemail.com [216.40.44.157]) by kanga.kvack.org (Postfix) with ESMTP id 998E46B006E for ; Thu, 18 Feb 2021 05:01:33 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6B24F3620 for ; Thu, 18 Feb 2021 10:01:33 +0000 (UTC) X-FDA: 77830946466.16.board45_3c013fa27654 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id 48CEB100E5A48 for ; Thu, 18 Feb 2021 10:01:33 +0000 (UTC) X-HE-Tag: board45_3c013fa27654 X-Filterd-Recvd-Size: 4698 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Thu, 18 Feb 2021 10:01:32 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1613642491; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4yHdRd/hxWaiH12nTXuW5kLIB588HE7uAVphy/I3/zs=; b=li4tdSAr9XOarrgJVgwIH7v+hf4PF6ifZU/lSDeW2Nj2b7c8Y2zoaBkAzMXDax465ArmjC /YNWCCKwM+edMaPdJ3ppGXtbbBp5rPzn5Pfjo17pVnUGEWLS13+ZevY60x7/5McTlO6f9b e7/B4wjet7hrQVdEhjYgd39yyxgFebQ= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 8A044ADDC; Thu, 18 Feb 2021 10:01:31 +0000 (UTC) Date: Thu, 18 Feb 2021 11:01:30 +0100 From: Michal Hocko To: Song Liu Cc: David Rientjes , Alex Shi , Hugh Dickins , Andrea Arcangeli , "Kirill A. Shutemov" , Matthew Wilcox , Minchan Kim , Vlastimil Babka , Chris Kennelly , Linux MM , Linux API Subject: Re: [RFC] Hugepage collapse in process context Message-ID: References: <9B5BFA9A-E945-4665-B335-A0B8E36D4463@fb.com> <97A31D94-671B-4400-8114-9039B28E54A7@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <97A31D94-671B-4400-8114-9039B28E54A7@fb.com> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 18-02-21 09:53:25, Song Liu wrote: > > > > On Feb 18, 2021, at 12:39 AM, Michal Hocko wrote: > > > > On Thu 18-02-21 08:11:13, Song Liu wrote: > >> > >> > >>> On Feb 16, 2021, at 8:24 PM, David Rientjes wrote: > >>> > >>> Hi everybody, > >>> > >>> Khugepaged is slow by default, it scans at most 4096 pages every 10s. > >>> That's normally fine as a system-wide setting, but some applications would > >>> benefit from a more aggressive approach (as long as they are willing to > >>> pay for it). > >>> > >>> Instead of adding priorities for eligible ranges of memory to khugepaged, > >>> temporarily speeding khugepaged up for the whole system, or sharding its > >>> work for memory belonging to a certain process, one approach would be to > >>> allow userspace to induce hugepage collapse. > >>> > >>> The benefit to this approach would be that this is done in process context > >>> so its cpu is charged to the process that is inducing the collapse. > >>> Khugepaged is not involved. > >>> > >>> Idea was to allow userspace to induce hugepage collapse through the new > >>> process_madvise() call. This allows us to collapse hugepages on behalf of > >>> current or another process for a vectored set of ranges. > >>> > >>> This could be done through a new process_madvise() mode *or* it could be a > >>> flag to MADV_HUGEPAGE since process_madvise() allows for a flag parameter > >>> to be passed. For example, MADV_F_SYNC. > >>> > >>> When done, this madvise call would allocate a hugepage on the right node > >>> and attempt to do the collapse in process context just as khugepaged would > >>> otherwise do. > >> > >> This is very interesting idea. One question, IIUC, the user process will > >> block until all small pages in given ranges are collapsed into THPs. > > > > Do you mean that PF would be blocked due to exclusive mmap_sem? Or is > > there anything else oyu have in mind? > > I was thinking about memory defragmentation when the application asks for > many THPs. Say the application looks like > > main() > { > malloc(); > madvise(HUGE); > process_madvise(); > > /* start doing work */ > } > > IIUC, when process_madvise() finishes, the THPs should be ready. However, > if defragmentation takes a long time, the process will wait in process_madvise(). OK, I see. The operation is definitely free which is to be expected. You can do the same from a thread which can spend time collapsing THPs. There are still internal resources that might block others - e.g. the above mentioned mmap_sem. We can try hard to reduce the lock time but this is unlikely to be completely free of any interruption of the workload. -- Michal Hocko SUSE Labs