From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 985A4C433DB for ; Fri, 19 Feb 2021 08:20:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EE34664EC0 for ; Fri, 19 Feb 2021 08:20:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EE34664EC0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2EDFE6B0071; Fri, 19 Feb 2021 03:20:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2A03D8D0002; Fri, 19 Feb 2021 03:20:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18F068D0001; Fri, 19 Feb 2021 03:20:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0007.hostedemail.com [216.40.44.7]) by kanga.kvack.org (Postfix) with ESMTP id 01F166B0071 for ; Fri, 19 Feb 2021 03:20:37 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B95E3181AF5E6 for ; Fri, 19 Feb 2021 08:20:37 +0000 (UTC) X-FDA: 77834320914.25.87B4290 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf14.hostedemail.com (Postfix) with ESMTP id 757F9C000C68 for ; Fri, 19 Feb 2021 08:20:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613722836; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rjPfJdHvh7od31Gt2k9bHEivBC+h3on21GgJt1VAVlM=; b=BtQnphhzckxsseqgRr/5c2uL+mz3utuAzVIN6Q5rHaUliR1XGLZdYENo2Ck6+NBWTczV0C DXawVaTRO56TQtk3gW1FQaIHglVFxvKH7SNUtcF3cJXr12crVIbsokim+pfF3m4ANA8/Az LMbQgEVtB1vlzjzh8yZPTszaskVqy7o= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-231-iXfvwTnJMSCq6d2scrfVLg-1; Fri, 19 Feb 2021 03:20:34 -0500 X-MC-Unique: iXfvwTnJMSCq6d2scrfVLg-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 40C6ABBEE2; Fri, 19 Feb 2021 08:20:30 +0000 (UTC) Received: from [10.36.113.117] (ovpn-113-117.ams2.redhat.com [10.36.113.117]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4F1396F95B; Fri, 19 Feb 2021 08:20:17 +0000 (UTC) To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Arnd Bergmann , Michal Hocko , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org References: <20210217154844.12392-1-david@redhat.com> <20210218225904.GB6669@xz-x1> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Message-ID: Date: Fri, 19 Feb 2021 09:20:16 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: <20210218225904.GB6669@xz-x1> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 757F9C000C68 X-Stat-Signature: fr1bsq6n3fmmzw9exxdb6mhwak5y17hd Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf14; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1613722832-218470 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 18.02.21 23:59, Peter Xu wrote: > Hi, David, >=20 > On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: >> When we manage sparse memory mappings dynamically in user space - also >> sometimes involving MADV_NORESERVE - we want to dynamically populate/ >> discard memory inside such a sparse memory region. Example users are >> hypervisors (especially implementing memory ballooning or similar >> technologies like virtio-mem) and memory allocators. In addition, we w= ant >> to fail in a nice way if populating does not succeed because we are ou= t of >> backend memory (which can happen easily with file-based mappings, >> especially tmpfs and hugetlbfs). >=20 > Could you explain a bit more on how do you plan to use this new interfa= ce for > the virtio-balloon scenario? Sure, that will bring up an interesting point to discuss=20 (MADV_POPULATE_WRITE). I'm planning on using it in virtio-mem: whenever the guests requests the=20 hypervisor (via a virtio-mem device) to make specific blocks available=20 ("plug"), I want to have a configurable option ("populate=3Don" /=20 "prealloc=3D"on") to perform safety checks ("prealloc") and populate page= =20 tables. This becomes especially relevant for private/shared hugetlbfs and shared=20 files/shmem where we have a limited pool size (e.g., huge pages, tmpfs=20 size, filesystem size). But it will also come in handy when just=20 preallocating (esp. zeroing) anonymous memory. For virito-balloon it is not applicable because it really only supports=20 anonymous memory and we cannot fail requests to deflate ... --- Example --- Example: Assume the guests requests to make 128 MB available and we're=20 using hugetlbfs. Assume we're out of huge pages in the hypervisor - we=20 want to fail the request - I want to do some kind of preallocation. So I could do fallocate() on anything that's MAP_SHARED, but not on=20 anything that's MAP_PRIVATE. hugetlbfs via memfd() cannot be=20 preallocated without going via SIGBUS handlers. --- QEMU memory configurations --- I see the following combinations relevant in QEMU that I want to support=20 with virito-mem: 1) MAP_PRIVATE anonymous memory 2) MAP_PRIVATE on hugetlbfs (esp. via memfd) 3) MAP_SHARED on hugetlbfs (esp. via memfd) 4) MAP_SHARED on shmem (file / memfd) 5) MAP_SHARED on some sparse file. Other MAP_PRIVATE mappings barely make any sense to me - "read the file=20 and write to page cache" is not really applicable to VM RAM (not to=20 mention doing fallocate(PUNCH_HOLE) that invalidates the private copies=20 of all other mappings on that file). --- Ways to populate/preallocate --- I see the following ways to populate/preallocate: a) MADV_POPULATE: write fault on writable MAP_PRIVATE, read fault on MAP_SHARED b) Writing to MAP_PRIVATE | MAP_SHARED from user space. c) (below) MADV_POPULATE_WRITE: write fault on writable MAP_PRIVATE | MAP_SHARED Especially, 2) is kind of weird as implemented in QEMU=20 (util/oslib-posix.c:do_touch_pages): "Read & write back the same value, so we don't corrupt existing user/app=20 data ... TODO: get a better solution from kernel so we don't need to=20 write at all so we don't cause wear on the storage backing the region..." So if we have zero, we write zero. We'll COW pages, triggering a write=20 fault - and that's the only good thing about it. For example, similar to=20 MADV_POPULATE, nothing stops KSM from merging anonymous pages again. So=20 for anonymous memory the actual write is not helpful at all. Similarly=20 for hugetlbfs, the actual write is not necessary - but there is no other=20 way to really achieve the goal. --- How MADV_POPULATE is useful --- With virito-mem, our VM will usually write to memory before it reads it. With 1) and 2) it does exactly what I want: trigger COW / allocate=20 memory and trigger a write fault. The only issue with 1) is that KSM=20 might come around and undo our work - but that could only be avoided by=20 writing random numbers to all pages from user space. Or we could simply=20 rather disable KSM in that setup ... --- How MADV_POPULATE is not perfect --- KSM can merge anonymous pages again. Just like the current QEMU=20 implementation. The only way around that is writing random numbers to=20 the pages or mlocking all memory. No big news. Nothing stops reclaim/swap code from depopulating when using files.=20 Again, no big new - we have to mlock. --- HOW MADV_POPULATE_WRITE might be useful --- With 3) 4) 5) MADV_POPULATE does partially what I want: preallocate=20 memory and populate page tables. But as it's a read fault, I think we'll=20 have another minor fault on access. Not perfect, but better than failing=20 with SIGBUS. One way around that would be having an additional=20 MADV_POPULATE_WRITE, to use in cases where it makes sense (I think at=20 least 3) and 4), most probably not on actual files like 5) ). Trigger a write fault without actually writing. Makes sense? --=20 Thanks, David / dhildenb