From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81482C433E0 for ; Fri, 19 Feb 2021 19:15:11 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2BF7A64E43 for ; Fri, 19 Feb 2021 19:15:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2BF7A64E43 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id B9E7E6B006C; Fri, 19 Feb 2021 14:15:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B268B8D0001; Fri, 19 Feb 2021 14:15:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9C9EA6B0070; Fri, 19 Feb 2021 14:15:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0067.hostedemail.com [216.40.44.67]) by kanga.kvack.org (Postfix) with ESMTP id 7BA096B006C for ; Fri, 19 Feb 2021 14:15:10 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 3C58012EC for ; Fri, 19 Feb 2021 19:15:10 +0000 (UTC) X-FDA: 77835970380.11.254B3E4 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf09.hostedemail.com (Postfix) with ESMTP id 5BA6C60024A3 for ; Fri, 19 Feb 2021 19:15:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613762108; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GuikDQq1sciUIXGoinMOJrB+j3nRzQ2K1NXhrcaqYcg=; b=V+qdyJl+teNWY825F4XzZbZjVNp9iTT8m79yWee3p/LdReW6tsnmVlVR+hf+UpRfnkTroq 7iNZmsAceylrdcYmo/M87buSk/5KUPRWvYOA9IbHNdJSlNDBsG6ODAbH+FG6prgQfv0jmJ YDo/0z8LWy1K8AZ7L1WiJyyZcUAEHqk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-386-On0zvAUYOq6qvarRhkE3xg-1; Fri, 19 Feb 2021 14:15:04 -0500 X-MC-Unique: On0zvAUYOq6qvarRhkE3xg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BC82A8030BB; Fri, 19 Feb 2021 19:15:00 +0000 (UTC) Received: from [10.36.113.117] (ovpn-113-117.ams2.redhat.com [10.36.113.117]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8671B5C1BB; Fri, 19 Feb 2021 19:14:46 +0000 (UTC) From: David Hildenbrand To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Arnd Bergmann , Michal Hocko , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org References: <20210217154844.12392-1-david@redhat.com> <20210218225904.GB6669@xz-x1> <20210219163157.GF6669@xz-x1> <41444eb8-8bb8-8d5b-4cec-be7fa7530d0e@redhat.com> Organization: Red Hat GmbH Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Message-ID: <4d8e6f55-66a6-d701-6a94-79f5e2b23e46@redhat.com> Date: Fri, 19 Feb 2021 20:14:45 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: <41444eb8-8bb8-8d5b-4cec-be7fa7530d0e@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 5BA6C60024A3 X-Stat-Signature: w3h3x5tmmth436r69eyk8j3pt3p9syid Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf09; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=63.128.21.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1613762106-877149 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >> It's interesting to know about commit 1e356fc14be ("mem-prealloc: redu= ce large >> guest start-up and migration time.", 2017-03-14). It seems for speedi= ng up VM >> boot, but what I can't understand is why it would cause the delay of h= ugetlb >> accounting - I thought we'd fail even earlier at either fallocate() on= the >> hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd wh= ich >> contains the huge pages. See hugetlb_reserve_pages() and its callers.= Or did >> I miss something? >=20 > We should fail on mmap() when the reservation happens (unless > MAP_NORESERVE is passed) I think. >=20 >> >> I think there's a special case if QEMU fork() with a MAP_PRIVATE huget= lbfs >> mapping, that could cause the memory accouting to be delayed until COW= happens. >=20 > That would be kind of weird. I'd assume the reservation gets properly > done during fork() - just like for VM_ACCOUNT. >=20 >> However that's definitely not the case for QEMU since QEMU won't work = at all as >> late as that point. >> >> IOW, for hugetlbfs I don't know why we need to populate the pages at a= ll if we >> simply want to know "whether we do still have enough space".. And IIU= C 2) >> above is the major issue you'd like to solve too. >=20 > To avoid page faults at runtime on access I think. Reservation <=3D > Preallocation. I just learned that there is more to it: (test done on v5.9) # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr= _hugepages # cat /sys/devices/system/node/node*/meminfo | grep HugePages_ Node 0 HugePages_Total: 512 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 # cat /proc/meminfo | grep HugePages_ HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 0 HugePages_Surp: 0 # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=3Dme= m0,size=3D1G,hugetlb=3Don,hugetlbsize=3D2M,policy=3Dbind,host-nodes=3D0 -= numa node,nodeid=3D0,memdev=3Dmem0 -hda Fedora-Cloud-Base-Rawhide-2020100= 4.n.1.x86_64.qcow2 -nographic -> works just fine # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=3Dme= m0,size=3D1G,hugetlb=3Don,hugetlbsize=3D2M,policy=3Dbind,host-nodes=3D1 -= numa node,nodeid=3D0,memdev=3Dmem0 -hda Fedora-Cloud-Base-Rawhide-2020100= 4.n.1.x86_64.qcow2 -nographic -> Does not fail nicely but crashes! See https://bugzilla.redhat.com/show_bug.cgi?id=3D1686261 for something s= imilar, however, it no longer applies like that on more recent kernels. Hugetlbfs reservations don't always protect you (especially with NUMA) - = that's why e.g., libvirt always tells QEMU to prealloc. I think the "issue" is that the reservation happens on mmap(). mbind() ru= ns afterwards. Preallocation saves you from that. I suspect something similar will happen with anonymous memory with mbind(= ) even if we reserved swap space. Did not test yet, though. --=20 Thanks, David / dhildenb