From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0B338C05051 for ; Mon, 31 Jul 2023 17:05:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9146F28007E; Mon, 31 Jul 2023 13:05:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8C52B28007A; Mon, 31 Jul 2023 13:05:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 78D0D28007E; Mon, 31 Jul 2023 13:05:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 66C6E28007A for ; Mon, 31 Jul 2023 13:05:43 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4405E1C9A50 for ; Mon, 31 Jul 2023 17:05:43 +0000 (UTC) X-FDA: 81072533766.07.7494336 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id EED1E100030 for ; Mon, 31 Jul 2023 17:05:40 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Ux6TVnhG; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690823141; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hXMQG488vI3cMLUaEwnC+ridVhu6Q8Hoz+AOs04ufnA=; b=pqwM2Wq74G7sOZeSkqHk2NRW37JpmrkCm5ogzPjhFG/zVRiNfk7+ZTdsWQf7zt9/zESue5 ek/JhZAxEzXKlvd8TrV7vMQCPDcQR6a26v1WgOVzy0664G31b4phVx8KURtHUpPpyATdFs cdruNU+mKedinRVBIJFNDaOYvqFj+gU= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Ux6TVnhG; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf05.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690823141; a=rsa-sha256; cv=none; b=cAz619xeKLE549tDm87orwPELQry3kA7GEK/q4/0seeqqTYrot2hfi55OVgdGpMD3PoSLu NJFxGXzLQTpjSYdBOnD+aWCmSiR9UFmOguOUvQ5rMQ9A/3wey5X7K5ZVc5Nu/OKFXtgh9t FDJur8nhA51dogOmeEbG04MFjm1pCYQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1690823140; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=hXMQG488vI3cMLUaEwnC+ridVhu6Q8Hoz+AOs04ufnA=; b=Ux6TVnhGKT3ZGAWdhKSJvFMJB28Ta6pReOI8K7X+s/yg0DeKNMh5/0P28IJTrqTW9X7onf dXy6i7AMTOGDvNkkfMQApXpV2XAPsipgSzp/BSEmzdIUVxZFHbrhCaqwu4o1nZsBDIM3Lz jz5Goc2VyAxKEJDGx+FBeJhhXxaBM7M= Received: from mail-vs1-f70.google.com (mail-vs1-f70.google.com [209.85.217.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-588-Bwgm6SEKNBK2sfZmM-vIbw-1; Mon, 31 Jul 2023 13:05:39 -0400 X-MC-Unique: Bwgm6SEKNBK2sfZmM-vIbw-1 Received: by mail-vs1-f70.google.com with SMTP id ada2fe7eead31-44760625ea6so214537137.0 for ; Mon, 31 Jul 2023 10:05:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690823138; x=1691427938; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=hXMQG488vI3cMLUaEwnC+ridVhu6Q8Hoz+AOs04ufnA=; b=cL39VDcufDimxQg42zu9uE8EO25oxF9y4oIYBjl/iWrTjMEQD7FG41KlF0oxWIH4+S CoGujOyeqqvMZGuaNIb1/w1013OkiTKyla2erZjLDiYC2q3QtXKEu9TONjAgDPxJHOr+ BLidyFM0g2x97fMkdhbOLF7Gd1a0qT55dNERa9x4sZhlegeK4iGd6io9WpjLrPhJ8+Mq Uw+wC3nTmawh3uC5wC63S+HnJhIXeUqhuWf7nYxPC939j9Nxe3q3SvWKDLlgqwq/fUrX gakpq0zEPfL55JNuzbkRRT7lssV72UktIsicwaIGwmRrIeUQiGizoFhTfswWuVnPQRTe qniQ== X-Gm-Message-State: ABy/qLbKgdo4NTvn26LARIUvyoPoshOZplpLVAWiwR2zYWehejQlI49l 6x5i2X3On1avbZw6J57nfpxiM5K5o4DL6fQ0orUcunS6vJ1I4mRVyE+bqDBiNvYRA44yJriFeMD TQrR8Qe5XUjU= X-Received: by 2002:a05:6102:290b:b0:443:603c:8a8 with SMTP id cz11-20020a056102290b00b00443603c08a8mr2446493vsb.2.1690823137685; Mon, 31 Jul 2023 10:05:37 -0700 (PDT) X-Google-Smtp-Source: APBJJlHNu6k+GxcYvTzEhfFyVUaebTVIrrRCnh3B/231hWdmMO9P1Pj68706EVlFSPTz7DGSEvHDfA== X-Received: by 2002:a05:6102:290b:b0:443:603c:8a8 with SMTP id cz11-20020a056102290b00b00443603c08a8mr2446469vsb.2.1690823137344; Mon, 31 Jul 2023 10:05:37 -0700 (PDT) Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id g9-20020ae9e109000000b00767b24f68edsm3428728qkm.62.2023.07.31.10.05.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 31 Jul 2023 10:05:37 -0700 (PDT) Date: Mon, 31 Jul 2023 13:05:34 -0400 From: Peter Xu To: "Kasireddy, Vivek" Cc: Hugh Dickins , Jason Gunthorpe , Alistair Popple , Gerd Hoffmann , "Kim, Dongwon" , David Hildenbrand , "Chang, Junxiao" , "linux-mm@kvack.org" , "dri-devel@lists.freedesktop.org" , Mike Kravetz Subject: Re: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages) Message-ID: References: <87jzuwlkae.fsf@nvdebian.thelocal> <87pm4nj6s5.fsf@nvdebian.thelocal> <75e3a74a-68f5-df-9a49-a0553c04320@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Stat-Signature: 8pkx3z1ris4xsg1z83hsfi9tz1w484k9 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: EED1E100030 X-HE-Tag: 1690823140-99978 X-HE-Meta: U2FsdGVkX1/g6Tawez3VCauPHPWqJdHgGSIniGyyPws9e3yqSzw+Xz3HxGaRHvHZgoLlNv2kzjOJwQ3cK8mWPaJBQ386SFNOaT+2enBYyxfn/d2wbvZixNkhqoNxOvQjZQ4xIwcLCrG4ZyL0n5APSmRAVkn/vwDMvP49wZWHoB3RrjG3YNeEpJwIZQ6adtu5jcnjuKK9f5bhACifcQAcLIFgEepgL0BgJPeakBkd4RRVhs1ogjMZik9Pq3OC/fXHZ7jGrIQHI/3YQH1EzKOpPUYlwp/sE8vTj1iehYlw4CVFqnE6ryDDOtJjZqPQAkdAbI/drDDZeE3XXw9BMOVJ9yWW+wy0nRIYEiQRdP7Kgry6UsthcDH7QtJUi/xPiYBAvf3+E0+PONNuRGzQYDFFN1KPHlvpAbXRymTdxmsj/skICNChanKZJ/EZbIEHZ5AC8kEVvahZnMtBGt9f/khIW4lfcwRjkz9XJFClYn19fEhCMwM2zAI9HlEFDgATtVYd0RxDzx01anQ4VivA8yaO4wmgGj1bQmzbkgGMIHvK+sao3lwxaQRRvJfBRVNzXhH8GU6mb8ejcupWKVxW8z10f0QrpPIlLybUTrnVJvx4yLzkmuFiMuvM9L4XXr6IpB2fQX5/zyiNoh8gViXyUTEGJDytoGxFjvtluXJ7mc3xaYHA4Ob9Ndak0P7g3DmMTgLaBaaGR0klI4hBJzqY16CZ7nhp4SeVRdyQSlVu6m1n0z1dD2vZXfRZLtCZZRmOO84uTYqqwAIFQZ1y1jBl/jLZ56IC3HKOeVmvtK5A/yld6W1kMhamUiXWR1gGQfZw7hkSvCcg7aIwqksM7xNbw++TL7t+M5wbNaSYuEMYVpYb39s7OsZOPYh3AKmft/29ZGh3QnGE6e8rHqHjkD2mppx1c23znqyWkIyFNVIBAxrZqkMBm7Yk6UDR0RHfkHg3UDxRrqMslZ4lDky62bbKRQT bd3Cu8dj QcyjcIfjOuR0+dycyUWa2rIIoMLenGxWiVmchvo37yciOt1YOrAiKs+Ur3+FXY8JX+jt8Xnoh6fNO5/dRTGPXvHu+yhTOMNe/f+sIe78OTDrmUIfkFs6O9y9+dzBYANPPOhL5GxFFH7aVDZ4IpTt2JBkI0DEzz/zCo6QGIf7zjQq8zyVYNWMzhV/KKdl7kKPwF+JhYX1n+tW3tSS0ALKqoJHE6WoojT3ZVaVom89fZoMBwf21HrbOtE8aXBXsODIIbqRzIEYXl8B79aQojne3VXOhXq6DKcty1sMCwSAuxi6KFmT0LDzdo1eWsNGVO75KMo7NlYX9Ag98AN3HoznJ2SvO9mPbPJgwf3M+IGYSB5bZWhBqg1wGmkVC4y9689AEjnlQ34HnHNT0oK8wR0lpQMTWQFg8YoePo7mSIfPiAjXg12Lqi7gxX+exhMsgTi82sEOQpNSjac/DE1VBNCHWXkQQy6MeAn7FiSb5uxb81vg5y46AokNwXB6ex0A5ty/jGmC1JMw4PmVKYXzu3CprPce0jdSQ63xoMXt1Fc5lRjUVaPQhBYKeFJiYCDaLQYbR2mekPf6xbzr+UFU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Jul 29, 2023 at 12:08:25AM +0000, Kasireddy, Vivek wrote: > Hi Peter, > > > > > > > > > I'm not at all familiar with the udmabuf use case but that sounds > > > > > > > > brittle and effectively makes this notifier udmabuf specific right? > > > > > > > Oh, Qemu uses the udmabuf driver to provide Host Graphics > > > > components > > > > > > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created > > > > > > > buffers. In other words, from a core mm standpoint, udmabuf just > > > > > > > collects a bunch of pages (associated with buffers) scattered inside > > > > > > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps > > > > > > > them in a dmabuf fd. And, since we provide zero-copy access, we > > > > > > > use DMA fences to ensure that the components on the Host and > > > > > > > Guest do not access the buffer simultaneously. > > > > > > > > > > > > So why do you need to track updates proactively like this? > > > > > As David noted in the earlier series, if Qemu punches a hole in its > > memfd > > > > > that goes through pages that are registered against a udmabuf fd, then > > > > > udmabuf needs to update its list with new pages when the hole gets > > > > > filled after (guest) writes. Otherwise, we'd run into the coherency > > > > > problem (between udmabuf and memfd) as demonstrated in the > > selftest > > > > > (patch #3 in this series). > > > > > > > > Wouldn't this all be very much better if Qemu stopped punching holes > > there? > > > I think holes can be punched anywhere in the memfd for various reasons. > > Some > > > > I just start to read this thread, even haven't finished all of them.. but > > so far I'm not sure whether this is right at all.. > > > > udmabuf is a file, it means it should follow the file semantics. Mmu > Right, it is a file but a special type of file given that it is a dmabuf. So, AFAIK, > operations such as truncate, FALLOC_FL_PUNCH_HOLE, etc cannot be done > on it. And, in our use-case, since udmabuf driver is sharing (or exporting) its > buffer (via the fd), consumers (or importers) of the dmabuf fd are expected > to only read from it. > > > notifier is per-mm, otoh. > > > > Imagine for some reason QEMU mapped the guest pages twice, udmabuf is > > created with vma1, so udmabuf registers the mm changes over vma1 only. > Udmabufs are created with pages obtained from the mapping using offsets > provided by Qemu. > > > > > However the shmem/hugetlb page cache can be populated in either vma1, or > > vma2. It means when populating on vma2 udmabuf won't get update notify > > at > > all, udmabuf pages can still be obsolete. Same thing to when multi-process > In this (unlikely) scenario you described above, IMHO it's very legal for qemu to do that, we won't want this to break so easily and silently simply because qemu mapped it twice. I would hope it'll not be myself to debug something like that. :) I actually personally have a tree that does exactly that: https://github.com/xzpeter/qemu/commit/62050626d6e511d022953165cc0f604bf90c5324 But that's definitely not in main line.. it shouldn't need special attention, either. Just want to say that it can always happen for various reasons especially in an relatively involved software piece like QEMU. > I think we could still find all the > VMAs (and ranges) where the guest buffer pages are mapped (and register > for PTE updates) using Qemu's mm_struct. The below code can be modified > to create a list of VMAs where the guest buffer pages are mapped. > static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf, > struct mm_struct *vmm_mm) > { > struct vm_area_struct *vma = NULL; > MA_STATE(mas, &vmm_mm->mm_mt, 0, 0); > unsigned long addr; > pgoff_t pg; > > mas_set(&mas, 0); > mmap_read_lock(vmm_mm); > mas_for_each(&mas, vma, ULONG_MAX) { > for (pg = 0; pg < ubuf->pagecount; pg++) { > addr = page_address_in_vma(ubuf->pages[pg], vma); > if (addr == -EFAULT) > break; > } > if (addr != -EFAULT) > break; > } > mmap_read_unlock(vmm_mm); > > return vma; > } This is hackish to me, and not working when across mm (multi-proc qemu). > > > QEMU is used, where we can have vma1 in QEMU while vma2 in the other > > process like vhost-user. > > > > I think the trick here is we tried to "hide" the fact that these are > > actually normal file pages, but we're doing PFNMAP on them... then we want > > the file features back, like hole punching.. > > > > If we used normal file operations, everything will just work fine; TRUNCATE > > will unmap the host mapped frame buffers when needed, and when > > accessed > > it'll fault on demand from the page cache. We seem to be trying to > > reinvent "truncation" for pfnmap but mmu notifier doesn't sound right to > > this at least.. > If we can figure out the VMA ranges where the guest buffer pages are mapped, > we should be able to register mmu notifiers for those ranges right? In general, sorry to say that, but, mmu notifiers still do not sound like the right approach here. > > > > > > of the use-cases where this would be done were identified by David. Here > > is what > > > he said in an earlier discussion: > > > "There are *probably* more issues on the QEMU side when udmabuf is > > paired > > > with things like MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE used for > > > virtio-balloon, virtio-mem, postcopy live migration, ... for example, in" > > > > Now after seething this, I'm truly wondering whether we can still simply > > use the file semantics we already have (for either shmem/hugetlb/...), or > > is it a must we need to use a single fd to represent all? > > > > Say, can we just use a tuple (fd, page_array) rather than the udmabuf > > itself to do host zero-copy mapping? the page_array can be e.g. a list of > That (tuple) is essentially what we are doing (with udmabuf) but in a > standardized way that follows convention using the dmabuf buffer sharing > framework that all the importers (other drivers and userspace components) > know and understand. > > > file offsets that points to the pages (rather than pinning the pages using > If we are using the dmabuf framework, the pages must be pinned when the > importers map them. Oh so the pages are for DMAs from hardwares, rather than accessed by the host programs? I really have merely zero knowledge from that aspect, sorry. If so I don't know how truncation can work with that, while keeping the page coherent. Hugh asked why not QEMU just doesn't do that truncation, I'll then ask the same. Probably virtio-mem will not be able to work. I think postcopy will not be affected - postcopy only drops pages at very early stage of dest QEMU, not after VM started there, so either not affected or maybe there's chance it'll work. IIUC it's then the same as VFIO attached then we try to blow some pages away from anything like virtio-balloon - AFAIR qemu just explicitly don't allow that to happen. See vfio_ram_block_discard_disable(). > > > FOLL_GET). The good thing is then the fd can be the guest memory file > > itself. With that, we can mmap() over the shmem/hugetlb in whatever vma > > and whatever process. Truncation (and actually everything... e.g. page > > migration, swapping, ... which will be disabled if we use PFNMAP pins) will > > just all start to work, afaiu. > IIUC, we'd not be able to use the fd of the guest memory file because the > dmabuf fds are expected to have constant size that reflects the size of the > buffer that is being shared. I just don't think it'd be feasible given all the > other restrictions: > https://www.kernel.org/doc/html/latest/driver-api/dma-buf.html?highlight=dma_buf#userspace-interface-notes Yeah I also don't know well on the dmabuf APIs, but I think if the page must be pinned for real world DMA then it's already another story to me.. what I said on the [guest_mem_fd, offset_array] tuple idea could only (if still possible..) work if the udmabuf access is only from the processor side, never from the device. Thanks, -- Peter Xu