From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B28E8C433ED for ; Tue, 18 May 2021 12:04:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4AA3B610CD for ; Tue, 18 May 2021 12:04:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4AA3B610CD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 107936B0102; Tue, 18 May 2021 08:04:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 08F3C8D0001; Tue, 18 May 2021 08:04:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A02016B0105; Tue, 18 May 2021 08:03:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0199.hostedemail.com [216.40.44.199]) by kanga.kvack.org (Postfix) with ESMTP id 55AC66B0102 for ; Tue, 18 May 2021 08:03:59 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id E30FE181AF5F7 for ; Tue, 18 May 2021 12:03:58 +0000 (UTC) X-FDA: 78154218156.35.5E5F623 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 6B362A0003B0 for ; Tue, 18 May 2021 12:03:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1621339437; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=F7A0h/wa7e7/Tasvo9gaiD2QfuqTtOZnS8ypc2xwcz8=; b=DTS27fLufhJV3alG0gV1+kkaf0+lif+eSU1HS0RWFF3kBPa7WbC92J6OD8GD85YQATrrhZ dfy7CQI23BRIDaaD/pQbusGxihlZks4C3GxnomXfZE7xREibH1WYr/uQ8iAzBrH/+wlKvj wjwEzab9FZB4vNEMTd8XrSh+FQuKdYM= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-501-rgR11D_mOYuKmyHz2S44YQ-1; Tue, 18 May 2021 08:03:56 -0400 X-MC-Unique: rgR11D_mOYuKmyHz2S44YQ-1 Received: by mail-wr1-f72.google.com with SMTP id 67-20020adf81490000b029010756d109e6so5443449wrm.13 for ; Tue, 18 May 2021 05:03:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=F7A0h/wa7e7/Tasvo9gaiD2QfuqTtOZnS8ypc2xwcz8=; b=do9Je0l+Q5umHKAZ6aQOs+3ztwCXmhIa+ySJ42o6b4zP9c1DIL8p3uoyGvNbdHQqRd esuPb9LcM6/SRL7d68PQxI/5+SvAQflGkCv2YULst4BxqOg/n3HSheDWphg45Sw5xVU8 t+XH0jz22NowWmS9o4pkZMeBKev6W1TMFh6TEQOeBByrzioUxjOvwfjdqBtducCQCJL8 NwTwBrNIYrfgvct83QEVu23fcX/57pVHS41RlidhZnvScKo9Gjn34s/0El6sfnwzw4TN ofCIWyd3bkP/pr2G27meGOnRt9+E6v9F+4m98BhBxmKMu5hVYWsX7Di/gxnHU+8vxzpf kHUg== X-Gm-Message-State: AOAM533le4OLfl6JCSykTQz3IW3SFEpjcQ40WvImnZu8uBEMjjP61Z6j 6BiF2bqu8atjIF8/UpmmJ7KCG+bsNDzw04FFleEZwPezJ5ZOQgce9uuEXsGGPbVAg8w4S4Sy/V4 y0HGeEwn8wb8= X-Received: by 2002:a1c:4c03:: with SMTP id z3mr5151152wmf.58.1621339434796; Tue, 18 May 2021 05:03:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyr5Bgt53O6C5VMUFGO4r7zh5yLm4RnZeG8MK0yO3eR1vv5vvDmG2V0a/pykWs1t6HOhZ/1gg== X-Received: by 2002:a1c:4c03:: with SMTP id z3mr5151081wmf.58.1621339434387; Tue, 18 May 2021 05:03:54 -0700 (PDT) Received: from [192.168.3.132] (p5b0c64fd.dip0.t-ipconnect.de. [91.12.100.253]) by smtp.gmail.com with ESMTPSA id q19sm2364810wmc.44.2021.05.18.05.03.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 18 May 2021 05:03:54 -0700 (PDT) To: Michal Hocko Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Arnd Bergmann , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , Mike Kravetz , Peter Xu , Rolf Eike Beer , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org, Linux API References: <20210511081534.3507-1-david@redhat.com> <20210511081534.3507-3-david@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH resend v2 2/5] mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables Message-ID: Date: Tue, 18 May 2021 14:03:52 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DTS27fLu; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf07.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-Stat-Signature: btthko7ozbxxj9bwcm1prybwnmd15tre X-Rspamd-Queue-Id: 6B362A0003B0 X-Rspamd-Server: rspam02 X-HE-Tag: 1621339436-503963 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >>> This means that you want to have two different uses depending on the >>> underlying mapping type. MADV_POPULATE_READ seems rather weak for >>> anonymous/private mappings. Memory backed by zero pages seems rather >>> unhelpful as the PF would need to do all the heavy lifting anyway. >>> Or is there any actual usecase when this is desirable? >> >> Currently, userfaultfd-wp, which requires "some mapping" to be able to= arm >> successfully. In QEMU, we currently have to prefault the shared zeropa= ge for >> userfaultfd-wp to work as expected. >=20 > Just for clarification. The aim is to reduce the memory footprint at th= e > same time, right? If that is really the case then this is worth adding. Yes. userfaultfd-wp is right now used in QEMU for background=20 snapshotting of VMs. Just because you trigger a background snapshot=20 doesn't mean that you want to COW all pages. (especially, if your VM=20 previously inflated the balloon, was using free page reporting etc.) >=20 >> I expect that use case might vanish over >> time (eventually with new kernels and updated user space), but it migh= t >> stick for a bit. >=20 > Could you elaborate some more please? After I raised that the current behavior of userfaultfd-wp is=20 suboptimal, Peter started working on a userfaultfd-wp mode that doesn't=20 require to prefault all pages just to have it working reliably --=20 getting notified when any page changes, including ones that haven't been=20 populated yet and would have been populated with the shared zeropage on=20 first access. Not sure what the state of that is and when we might see it= . >=20 >> Apart from that, populating the shared zeropage might be relevant in s= ome >> corner cases: I remember there are sparse matrix algorithms that opera= te >> heavily on the shared zeropage. >=20 > I am not sure I see why this would be a useful interface for those? Zer= o > page read fault is really low cost. Or are you worried about cummulativ= e > overhead by entering the kernel many times? Yes, cumulative overhead when dealing with large, sparse matrices. Just=20 an example where I think it could be applied in the future -- but not=20 that I consider populating the shared zeropage a really important use=20 case in general (besides for userfaultfd-wp right now). >=20 >>> So the split into these two modes seems more like gup interface >>> shortcomings bubbling up to the interface. I do expect userspace only >>> cares about pre-faulting the address range. No matter what the backin= g >>> storage is. >>> >>> Or do I still misunderstand all the usecases? >> >> Let me give you an example where we really cannot tell what would be b= est >> from a kernel perspective. >> >> a) Mapping a file into a VM to be used as RAM. We might expect the gue= st >> writing all memory immediately (e.g., booting Windows). We would want >> MADV_POPULATE_WRITE as we expect a write access immediately. >> >> b) Mapping a file into a VM to be used as fake-NVDIMM, for example, RO= OTFS >> or just data storage. We expect mostly reading from this memory, thus,= we >> would want MADV_POPULATE_READ. >=20 > I am afraid I do not follow. Could you be more explicit about advantage= s > of using those two modes for those example usecases? Is that to share > resources (e.g. by not breaking CoW)? I'm only talking about shared mappings "ordinary files" for now, because=20 that's where MADV_POPULATE_READ vs MADV_POPULATE_WRITE differ in regards=20 of "mark something dirty and write it back"; CoW doesn't apply to shared=20 mappings, it's really just a difference in dirtying and having to write=20 back. For things like PMEM/hugetlbfs/... we usually want=20 MADV_POPULATE_WRITE because then we'd avoid a context switch when our VM=20 actually writes to a page the first time -- and we don't care about=20 dirtying, because we don't have writeback. But again, that's just one use case I have in mind coming from the VM=20 area. I consider MADV_POPULATE_READ really only useful when we are=20 expecting mostly read access on a mapping. (I assume there are other use=20 cases for databases etc. not explored yet where MADV_POPULATE_WRITE=20 would not be desired for performance reasons) >=20 >> Instead of trying to be smart in the kernel, I think for this case it = makes >> much more sense to provide user space the options. IMHO it doesn't rea= lly >> hurt to let user space decide on what it thinks is best. >=20 > I am mostly worried that this will turn out to be more confusing than > helpful. People will need to grasp non trivial concepts and kernel > internal implementation details about how read/write faults are handled= . And that's the point: in the simplest case (without any additional=20 considerations about the underlying mapping), if you end up mostly=20 *reading* MADV_POPULATE_READ is the right thing. If you end up mostly=20 *writing* MADV_POPULATE_WRITE is the right thing. Only care has to be=20 taken when you really want a "prealloction" as in "allocate backend=20 storage" or "don't ever use the shared zeropage". I agree that these=20 details require more knowledge, but so does anything that messes with=20 memory mappings on that level (VMs, databases, ...). QEMU currently implements exactly these two cases manually in user space. Anyhow, please suggest a way to handle it via a single flag in the=20 kernel -- which would be some kind of heuristic as we know from=20 MAP_POPULATE. Having an alternative at hand would make it easier to=20 discuss this topic further. I certainly *don't* want MAP_POPULATE=20 semantics when it comes to MADV_POPULATE, especially when it comes to=20 shared mappings. Not useful in QEMU now and in the future. We could make MADV_POPULATE act depending on the readability/writability=20 of a mapping. Use MADV_POPULATE_WRITE on writable mappings, use=20 MADV_POPULATE_READ on readable mappings. Certainly not perfect for use=20 cases where you have writable mappings that are mostly read only (as in=20 the example with fake-NVDIMMs I gave ...), but if it makes people happy,=20 fine with me. I mostly care about MADV_POPULATE_WRITE. --=20 Thanks, David / dhildenb