From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0B2DC433DB for ; Tue, 5 Jan 2021 09:39:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4773922482 for ; Tue, 5 Jan 2021 09:39:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4773922482 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C4C528D0079; Tue, 5 Jan 2021 04:39:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BFC588D006E; Tue, 5 Jan 2021 04:39:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A9E198D0079; Tue, 5 Jan 2021 04:39:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0057.hostedemail.com [216.40.44.57]) by kanga.kvack.org (Postfix) with ESMTP id 933D78D006E for ; Tue, 5 Jan 2021 04:39:44 -0500 (EST) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 6666D10F88 for ; Tue, 5 Jan 2021 09:39:44 +0000 (UTC) X-FDA: 77671224288.09.glove40_390626b274d7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin09.hostedemail.com (Postfix) with ESMTP id 4DE58180AD80F for ; Tue, 5 Jan 2021 09:39:44 +0000 (UTC) X-HE-Tag: glove40_390626b274d7 X-Filterd-Recvd-Size: 6991 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Tue, 5 Jan 2021 09:39:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1609839583; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=a9GY3k7qYcAOAzWTuR5xc+9LJ5x2f7CmJLEuHkNfjAg=; b=QNHpBttqMUKVfyt+WFi0Wy0/tiUgOqR7OqtrlnfK2S+KjrblOFm2fTVJ6y2FJp9klrSSON /hPc0ivwcSxa0kNHM6PDAIKCokLaX1OhHipPDrpwFui548y0pghZJsjuYBrE64iwVfzeKS 7uJO5v6aARkWXqlxA0GxTDU7YYBsrUI= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-73-J0kE12EeOROLHktCQJ6qww-1; Tue, 05 Jan 2021 04:39:39 -0500 X-MC-Unique: J0kE12EeOROLHktCQJ6qww-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C0EB0800D62; Tue, 5 Jan 2021 09:39:36 +0000 (UTC) Received: from [10.36.114.117] (ovpn-114-117.ams2.redhat.com [10.36.114.117]) by smtp.corp.redhat.com (Postfix) with ESMTP id 28F0E60873; Tue, 5 Jan 2021 09:39:26 +0000 (UTC) Subject: Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO To: Liang Li Cc: Alexander Duyck , Mel Gorman , Andrew Morton , Andrea Arcangeli , Dan Williams , "Michael S. Tsirkin" , Jason Wang , Dave Hansen , Michal Hocko , Liang Li , linux-mm , LKML , virtualization@lists.linux-foundation.org References: <96BB0656-F234-4634-853E-E2A747B6ECDB@redhat.com> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: Date: Tue, 5 Jan 2021 10:39:26 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05.01.21 03:14, Liang Li wrote: >>>>> In our production environment, there are three main applications ha= ve such >>>>> requirement, one is QEMU [creating a VM with SR-IOV passthrough dev= ice], >>>>> anther other two are DPDK related applications, DPDK OVS and SPDK v= host, >>>>> for best performance, they populate memory when starting up. For SP= DK vhost, >>>>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for >>>>> vhost 'live' upgrade, which is done by killing the old process and >>>>> starting a new >>>>> one with the new binary. In this case, we want the new process star= ted as quick >>>>> as possible to shorten the service downtime. We really enable this = feature >>>>> to speed up startup time for them :) >> >> Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted = between shutting down the old instances and firing up the new instance ju= st solve this issue? >=20 > You are right, it works for the SPDK vhost upgrade case. >=20 >> >>>> >>>> Thanks for info on the use case! >>>> >>>> All of these use cases either already use, or could use, huge pages >>>> IMHO. It's not your ordinary proprietary gaming app :) This is where >>>> pre-zeroing of huge pages could already help. >>> >>> You are welcome. For some historical reason, some of our services ar= e >>> not using hugetlbfs, that is why I didn't start with hugetlbfs. >>> >>>> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ... >>>> creating a file and pre-zeroing it from another process, or am I mis= sing >>>> something important? At least for QEMU this should work AFAIK, where= you >>>> can just pass the file to be use using memory-backend-file. >>>> >>> If using another process to create a file, we can offload the overhea= d to >>> another process, and there is no need to pre-zeroing it's content, ju= st >>> populating the memory is enough. >> >> Right, if non-zero memory can be tolerated (e.g., for vms usually has = to). >=20 > I mean there is no need to pre-zeroing the file content obviously in us= er space, > the kernel will do it when populating the memory. >=20 >>> If we do it that way, then how to determine the size of the file? it = depends >>> on the RAM size of the VM the customer buys. >>> Maybe we can create a file >>> large enough in advance and truncate it to the right size just before= the >>> VM is created. Then, how many large files should be created on a host= ? >> >> That=E2=80=98s mostly already existing scheduling logic, no? (How many= vms can I put onto a specific machine eventually) >=20 > It depends on how the scheduling component is designed. Yes, you can pu= t > 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on > another one. But if one type of them, e.g. 4C8G are sold out, customers > can't by more 4C8G VM while there are some free 2C4G VMs, the resource > reserved for them can be provided as 4C8G VMs >=20 1. You can, just the startup time will be a little slower? E.g., grow pre-allocated 4G file to 8G. 2. Or let's be creative: teach QEMU to construct a single RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you don't go crazy on different VM sizes / size differences. 3. In your example above, you can dynamically rebalance as VMs are getting sold, to make sure you always have "big ones" lying around you can shrink on demand. >=20 > You must know there are a lot of functions in the kernel which can > be done in userspace. e.g. Some of the device emulations like APIC, > vhost-net backend which has userspace implementation. :) > Bad or not depends on the benefits the solution brings. > From the viewpoint of a user space application, the kernel should > provide high performance memory management service. That's why > I think it should be done in the kernel. As I expressed a couple of times already, I don't see why using hugetlbfs and implementing some sort of pre-zeroing there isn't sufficien= t. We really don't *want* complicated things deep down in the mm core if there are reasonable alternatives. --=20 Thanks, David / dhildenb