From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 919B8C001E0 for ; Wed, 2 Aug 2023 12:40:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3242C28016F; Wed, 2 Aug 2023 08:40:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2D46D280143; Wed, 2 Aug 2023 08:40:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C3C128016F; Wed, 2 Aug 2023 08:40:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0BDA9280143 for ; Wed, 2 Aug 2023 08:40:43 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C75EB140543 for ; Wed, 2 Aug 2023 12:40:42 +0000 (UTC) X-FDA: 81079123524.20.1CE5DD3 Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) by imf27.hostedemail.com (Postfix) with ESMTP id DC23140020 for ; Wed, 2 Aug 2023 12:40:38 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=fIHR968D; dmarc=none; spf=none (imf27.hostedemail.com: domain of daniel@ffwll.ch has no SPF policy when checking 209.85.128.47) smtp.mailfrom=daniel@ffwll.ch ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690980039; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qfAZNrl561Z/GsxMVxtg7BOh6TsX7MHS4HNVcAX+in0=; b=xE9FyQ8MhtMMXBhvZJGy9Y79pWgb1crvVOfNqgTXcb9d4b3oZYNlhGhDPfgw0yijwc7bxg 31ghlHiYDPIP+iBaEbga1g5FtxMjV7NCJWScM4cbGjGdoX/7Rk2rKtQoebWpFc4kZM+VVE r4l0CHRDsh01oaTI5qB+/Uer+1QqmLg= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=fIHR968D; dmarc=none; spf=none (imf27.hostedemail.com: domain of daniel@ffwll.ch has no SPF policy when checking 209.85.128.47) smtp.mailfrom=daniel@ffwll.ch ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690980039; a=rsa-sha256; cv=none; b=PKZu0OFajg4lHheuGnDUTKuQbBv0xdR6eVCqa7Z9RsIZTxRhY6tqJDhY4Q0NzJrqQmgix3 OZMvAJnCo0Ybi0AGPOlFHLsOtN8oc6IC9qb1ukSv0/r0oC94E1FZ6VzB5w6Xi4dHC2o8yw wKJaHHKZfVGC4CMsCpCUppc1Jl1JkOY= Received: by mail-wm1-f47.google.com with SMTP id 5b1f17b1804b1-3fe1b6192e8so10077885e9.1 for ; Wed, 02 Aug 2023 05:40:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1690980037; x=1691584837; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date:from:to:cc :subject:date:message-id:reply-to; bh=qfAZNrl561Z/GsxMVxtg7BOh6TsX7MHS4HNVcAX+in0=; b=fIHR968DMldVR4FnTaabKMYoyur+MjJkXs6mVN5gVwgveZvDsoBrKXNMqWaSog/kBO IrgFc8Vj+/3uJV+eOyPdEG1Ra8IKoOmepEGXORUDwkqzTY2xaKU3PgoGZjufvU8TE6F9 5kPe7qFRoG36ymmD825fZUzmZkIGthIHuOUaw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690980037; x=1691584837; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=qfAZNrl561Z/GsxMVxtg7BOh6TsX7MHS4HNVcAX+in0=; b=gv53BA74HGixzwR0ufwAnBZ+7PgRrOJl8Bleav8XCrCvWMhRk5+p8G9h1BzBH2Rv8w PeTlNsU3d9XY+mKSrdsM10zB3GSL7ydX+po2FO/Hw6vlXZBcu9CTF2kG+IGmsgnHGY+c GfgKcGVE1WOLCk2p0yIZFMIjFe8VR3XCljtgtAFaQAnRaspGsuuv1ogpzu08r9TJjdQo OpDlJGX6dCAYrPOiw6KigXDCdu6YAvirXMcAaIre6Ka/1PxRikn2NDyGvYkNbtfbc8WV Gi6eotnsomRzh0hFQhJZGeT07c3vZQbzZbxwOwOfoxsIsHavu3o+KvaNQF4gMkFx2jgE dIgQ== X-Gm-Message-State: ABy/qLYXwfpeZ11j0tZlHISGAXCBbtudrYMJASC5AtYaDAypoNhWy1hW tuotQcj4Dmc/OeEznBBmjA064A== X-Google-Smtp-Source: APBJJlFOJT1uHjHLG7sMrvWJgfjchp9HjuR/0QC8vJjzOCt9x8ezqgdOcEn2CKEUIfEtw1woaytDoA== X-Received: by 2002:a05:600c:30d4:b0:3fb:f025:9372 with SMTP id h20-20020a05600c30d400b003fbf0259372mr10904223wmn.4.1690980037013; Wed, 02 Aug 2023 05:40:37 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa]) by smtp.gmail.com with ESMTPSA id m14-20020a7bce0e000000b003fbc9b9699dsm1541070wmc.45.2023.08.02.05.40.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 02 Aug 2023 05:40:36 -0700 (PDT) Date: Wed, 2 Aug 2023 14:40:34 +0200 From: Daniel Vetter To: Vivek Kasireddy Cc: dri-devel@lists.freedesktop.org, linux-mm@kvack.org, Dongwon Kim , David Hildenbrand , Junxiao Chang , Hugh Dickins , Peter Xu , Gerd Hoffmann , Jason Gunthorpe , Mike Kravetz Subject: Re: [RFC v1 2/3] udmabuf: Replace pages when there is FALLOC_FL_PUNCH_HOLE in memfd Message-ID: Mail-Followup-To: Vivek Kasireddy , dri-devel@lists.freedesktop.org, linux-mm@kvack.org, Dongwon Kim , David Hildenbrand , Junxiao Chang , Hugh Dickins , Peter Xu , Gerd Hoffmann , Jason Gunthorpe , Mike Kravetz References: <20230718082858.1570809-1-vivek.kasireddy@intel.com> <20230718082858.1570809-3-vivek.kasireddy@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230718082858.1570809-3-vivek.kasireddy@intel.com> X-Operating-System: Linux phenom 6.3.0-2-amd64 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: DC23140020 X-Stat-Signature: s46ifnrhg5e4d5czg8oox7skcsfqir5t X-HE-Tag: 1690980038-715902 X-HE-Meta: U2FsdGVkX1/Q+Q0UQBdWD6Z4/LkME05tn2irR/ihXu1WyPQaQHgJWcU6CJPbvbjr6a2zwe27xU9i2QZlTXy5rbYABBSAQcfxTPEmzE5A2pXe69ILck32vd1aZQVb9hdCEBAC0Wx41stI+GvlPYfRceuXgmDpV2/8R/7PSQZmVGy5KC/NX7t+o8waVcRiWgOI6vqoE8mdFpmsSSPgFu+/CTVkm2ZQn42wvexb6VOrWMBpAzSOloFMz6wR3f9bHcIjXVfDMiJlGPIVOoK0VUcPcbTuXcGtJ0eCWUuMQMr5N8PsoNd6Ff0pfs4W8UdWQIYHoz7xBfULuK57FGvZ12NKQiZLr+OnjxNFnuqKMq7WqQ2tQaDt6WCKy40GEMCJJv9ApunRzJZoaDQdds1ELRO2jpGPO0yrf9m48vDGtf2BcBGU7JV851LLpa3A1TSXGhgEva+mkvIlCmHIH85OphLUHR18ff39+kQW87ZMmTblD56VXPQ5GgqPjAiL13/uOqB2ZgA+PRxasiEPx2l9mNQ6Vypmt/R5Yem2UatQAt2XnQYduGs0jJy9OBF4MF97scnbL+61/6xfs7r7g0QGiUmOlCEOGHhdm/GhwfNb7iH9eQLQWzr4dXI5n+tBpXUVYr0UuD7ZhbWCuH1zWsFjLb3ylgmqcPE6g9el9TTWTHUzGXXUl+mM0XK+V5dI9r1LShigUAX77TxV9thTzdRoi8HkbsLpA9ZwdyV3gWAJalwILanac6ucwrKilY5+A+vK0i5XD+og+b/Ykaze/p8SiY9o3lhrZD3YhK2bgEN+FKPc0TbCnhUbfDg4aGRdfQab7rNtIDt690lB4shmWYYYN4S4KYaznOeRFbtQ9Poo5MACOZO0tzWGFsOi6xiUhoYyFFoQmsTbsTEJxIoqyhVsoyJ+9CGdRjVpLK0zRhZyp0qj4WMPs/GBGSJKN91/NPNztdVHGmLRvNnAQcUr5tpKsbR 9nEy0FQm KHLPlX7onCqx5RkpkB6gVPCzE/NIzcbTgqfr+Kc1xKpb1uT3Xc3Z1xVUAPNRe9teEHhMRg5o6XDWT/IvpjNNi0Wt9Wivmup5Lq1eRvwlhXgJUl8elxwKlZe4TFmBcCPPpC0Ss3bmUgFf0QE7p3bWnjPW4b6IQptUj+aUSfSSeR2fPby5Hv0YAu18H2sPDsOTV9JHoNdtbZCvQcxvvfLk6ybszlocaDJQ/VYkTliz2zGjBUGDz396z8CquqbTlh2FMZxxDRcWD4y2kQPyfAFN1wu67Iwl78WL44r5ZjtTutPnQ6kMFq40jDf52jlvXM34AT6IOQIuszoEcbOck63Yxyj4fGj0y3wm1rBPTsxAoPgplpmEfwPh7JL9+C58xZQvXyE/rJUlpkc8EhPOG3L8+XK5pQ7echQ4AAtnexySkLKRA+IW3taZL1ggAr/7wBYob/OrnOEgO9yjHEM0/z5VyoSOoUb6xWrlHAdZ2bUJSWcntQDZgHTSkuMXR9w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 18, 2023 at 01:28:57AM -0700, Vivek Kasireddy wrote: > When a hole is punched in the memfd or when a page is replaced for > any reason, the udmabuf driver needs to get notified in order to > update its list of pages with the new page. To accomplish this, we > first identify the vma ranges where pages associated with a given > udmabuf are mapped to and then register a handler for update_mapping > mmu notifier for receiving mapping updates. > > Once we get notified about a new page faulted in at a given offset > in the mapping (backed by shmem or hugetlbfs), the list of pages > is updated and we also zap the relevant PTEs associated with the > vmas that have mmap'd the udmabuf fd. > > Cc: David Hildenbrand > Cc: Mike Kravetz > Cc: Hugh Dickins > Cc: Peter Xu > Cc: Jason Gunthorpe > Cc: Gerd Hoffmann > Cc: Dongwon Kim > Cc: Junxiao Chang > Signed-off-by: Vivek Kasireddy I think the long thread made it clear already, so just for the record: This wont work. udmabuf is very intentionally about pin_user_page semantics, if you change the underlying mapping, you get to keep all the pieces. The _only_ way to make this work by implementing the dma_buf move notification infrastructure, and most importers can't cope with such dynamic dma-buf. And so most likely will not solve your use-case. Everything else races in a fundamental and unfixable way. -Daniel > --- > drivers/dma-buf/udmabuf.c | 172 ++++++++++++++++++++++++++++++++++++++ > 1 file changed, 172 insertions(+) > > diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c > index 10c47bf77fb5..189a36c41906 100644 > --- a/drivers/dma-buf/udmabuf.c > +++ b/drivers/dma-buf/udmabuf.c > @@ -4,6 +4,8 @@ > #include > #include > #include > +#include > +#include > #include > #include > #include > @@ -30,6 +32,23 @@ struct udmabuf { > struct sg_table *sg; > struct miscdevice *device; > pgoff_t *offsets; > + struct udmabuf_vma_range *ranges; > + unsigned int num_ranges; > + struct mmu_notifier notifier; > + struct mutex mn_lock; > + struct list_head mmap_vmas; > +}; > + > +struct udmabuf_vma_range { > + struct file *memfd; > + pgoff_t ubufindex; > + unsigned long start; > + unsigned long end; > +}; > + > +struct udmabuf_mmap_vma { > + struct list_head vma_link; > + struct vm_area_struct *vma; > }; > > static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf) > @@ -42,28 +61,54 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf) > if (pgoff >= ubuf->pagecount) > return VM_FAULT_SIGBUS; > > + mutex_lock(&ubuf->mn_lock); > pfn = page_to_pfn(ubuf->pages[pgoff]); > if (ubuf->offsets) { > pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT; > } > + mutex_unlock(&ubuf->mn_lock); > > return vmf_insert_pfn(vma, vmf->address, pfn); > } > > +static void udmabuf_vm_close(struct vm_area_struct *vma) > +{ > + struct udmabuf *ubuf = vma->vm_private_data; > + struct udmabuf_mmap_vma *mmap_vma; > + > + list_for_each_entry(mmap_vma, &ubuf->mmap_vmas, vma_link) { > + if (mmap_vma->vma == vma) { > + list_del(&mmap_vma->vma_link); > + kfree(mmap_vma); > + break; > + } > + } > +} > + > static const struct vm_operations_struct udmabuf_vm_ops = { > .fault = udmabuf_vm_fault, > + .close = udmabuf_vm_close, > }; > > static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma) > { > struct udmabuf *ubuf = buf->priv; > + struct udmabuf_mmap_vma *mmap_vma; > > if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0) > return -EINVAL; > > + mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL); > + if (!mmap_vma) > + return -ENOMEM; > + > vma->vm_ops = &udmabuf_vm_ops; > vma->vm_private_data = ubuf; > vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP); > + > + mmap_vma->vma = vma; > + list_add(&mmap_vma->vma_link, &ubuf->mmap_vmas); > + > return 0; > } > > @@ -109,6 +154,7 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf, > if (ret < 0) > goto err_alloc; > > + mutex_lock(&ubuf->mn_lock); > for_each_sg(sg->sgl, sgl, ubuf->pagecount, i) { > offset = ubuf->offsets ? ubuf->offsets[i] : 0; > sg_set_page(sgl, ubuf->pages[i], PAGE_SIZE, offset); > @@ -116,9 +162,12 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf, > ret = dma_map_sgtable(dev, sg, direction, 0); > if (ret < 0) > goto err_map; > + > + mutex_unlock(&ubuf->mn_lock); > return sg; > > err_map: > + mutex_unlock(&ubuf->mn_lock); > sg_free_table(sg); > err_alloc: > kfree(sg); > @@ -157,6 +206,9 @@ static void release_udmabuf(struct dma_buf *buf) > > for (pg = 0; pg < ubuf->pagecount; pg++) > put_page(ubuf->pages[pg]); > + > + mmu_notifier_unregister(&ubuf->notifier, ubuf->notifier.mm); > + kfree(ubuf->ranges); > kfree(ubuf->offsets); > kfree(ubuf->pages); > kfree(ubuf); > @@ -208,6 +260,93 @@ static const struct dma_buf_ops udmabuf_ops = { > .end_cpu_access = end_cpu_udmabuf, > }; > > +static void invalidate_mmap_vmas(struct udmabuf *ubuf, > + struct udmabuf_vma_range *range, > + unsigned long address, unsigned long size) > +{ > + struct udmabuf_mmap_vma *vma; > + unsigned long start = range->ubufindex << PAGE_SHIFT; > + > + start += address - range->start; > + list_for_each_entry(vma, &ubuf->mmap_vmas, vma_link) { > + zap_vma_ptes(vma->vma, vma->vma->vm_start + start, size); > + } > +} > + > +static struct udmabuf_vma_range *find_udmabuf_range(struct udmabuf *ubuf, > + unsigned long address) > +{ > + struct udmabuf_vma_range *range; > + int i; > + > + for (i = 0; i < ubuf->num_ranges; i++) { > + range = &ubuf->ranges[i]; > + if (address >= range->start && address < range->end) > + return range; > + } > + > + return NULL; > +} > + > +static void update_udmabuf(struct mmu_notifier *mn, struct mm_struct *mm, > + unsigned long address, unsigned long pfn) > +{ > + struct udmabuf *ubuf = container_of(mn, struct udmabuf, notifier); > + struct udmabuf_vma_range *range = find_udmabuf_range(ubuf, address); > + struct page *old_page, *new_page; > + pgoff_t pgoff, pgshift = PAGE_SHIFT; > + unsigned long size = 0; > + > + if (!range || !pfn_valid(pfn)) > + return; > + > + if (is_file_hugepages(range->memfd)) > + pgshift = huge_page_shift(hstate_file(range->memfd)); > + > + mutex_lock(&ubuf->mn_lock); > + pgoff = range->ubufindex + ((address - range->start) >> pgshift); > + old_page = ubuf->pages[pgoff]; > + new_page = pfn_to_page(pfn); > + > + do { > + ubuf->pages[pgoff] = new_page; > + get_page(new_page); > + put_page(old_page); > + size += PAGE_SIZE; > + } while (ubuf->pages[++pgoff] == old_page); > + > + mutex_unlock(&ubuf->mn_lock); > + invalidate_mmap_vmas(ubuf, range, address, size); > +} > + > +static const struct mmu_notifier_ops udmabuf_update_ops = { > + .update_mapping = update_udmabuf, > +}; > + > +static struct vm_area_struct *find_guest_ram_vma(struct udmabuf *ubuf, > + struct mm_struct *vmm_mm) > +{ > + struct vm_area_struct *vma = NULL; > + MA_STATE(mas, &vmm_mm->mm_mt, 0, 0); > + unsigned long addr; > + pgoff_t pg; > + > + mas_set(&mas, 0); > + mmap_read_lock(vmm_mm); > + mas_for_each(&mas, vma, ULONG_MAX) { > + for (pg = 0; pg < ubuf->pagecount; pg++) { > + addr = page_address_in_vma(ubuf->pages[pg], vma); > + if (addr == -EFAULT) > + break; > + } > + if (addr != -EFAULT) > + break; > + } > + mmap_read_unlock(vmm_mm); > + > + return vma; > +} > + > #define SEALS_WANTED (F_SEAL_SHRINK) > #define SEALS_DENIED (F_SEAL_WRITE) > > @@ -218,6 +357,7 @@ static long udmabuf_create(struct miscdevice *device, > DEFINE_DMA_BUF_EXPORT_INFO(exp_info); > struct file *memfd = NULL; > struct address_space *mapping = NULL; > + struct vm_area_struct *guest_ram; > struct udmabuf *ubuf; > struct dma_buf *buf; > pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit; > @@ -252,6 +392,13 @@ static long udmabuf_create(struct miscdevice *device, > goto err; > } > > + ubuf->ranges = kmalloc_array(head->count, sizeof(*ubuf->ranges), > + GFP_KERNEL); > + if (!ubuf->ranges) { > + ret = -ENOMEM; > + goto err; > + } > + > pgbuf = 0; > for (i = 0; i < head->count; i++) { > ret = -EBADFD; > @@ -270,6 +417,8 @@ static long udmabuf_create(struct miscdevice *device, > goto err; > pgoff = list[i].offset >> PAGE_SHIFT; > pgcnt = list[i].size >> PAGE_SHIFT; > + ubuf->ranges[i].ubufindex = pgbuf; > + ubuf->ranges[i].memfd = memfd; > if (is_file_hugepages(memfd)) { > if (!ubuf->offsets) { > ubuf->offsets = kmalloc_array(ubuf->pagecount, > @@ -299,6 +448,7 @@ static long udmabuf_create(struct miscdevice *device, > get_page(hpage); > ubuf->pages[pgbuf] = hpage; > ubuf->offsets[pgbuf++] = chunkoff << PAGE_SHIFT; > + > if (++chunkoff == maxchunks) { > put_page(hpage); > hpage = NULL; > @@ -334,6 +484,25 @@ static long udmabuf_create(struct miscdevice *device, > goto err; > } > > + guest_ram = find_guest_ram_vma(ubuf, current->mm); > + if (!guest_ram) > + goto err; > + > + ubuf->notifier.ops = &udmabuf_update_ops; > + ret = mmu_notifier_register(&ubuf->notifier, current->mm); > + if (ret) > + goto err; > + > + ubuf->num_ranges = head->count; > + for (i = 0; i < ubuf->num_ranges; i++) { > + page = ubuf->pages[ubuf->ranges[i].ubufindex]; > + ubuf->ranges[i].start = page_address_in_vma(page, guest_ram); > + ubuf->ranges[i].end = ubuf->ranges[i].start + list[i].size; > + } > + > + INIT_LIST_HEAD(&ubuf->mmap_vmas); > + mutex_init(&ubuf->mn_lock); > + > flags = 0; > if (head->flags & UDMABUF_FLAGS_CLOEXEC) > flags |= O_CLOEXEC; > @@ -344,6 +513,9 @@ static long udmabuf_create(struct miscdevice *device, > put_page(ubuf->pages[--pgbuf]); > if (memfd) > fput(memfd); > + if (ubuf->notifier.mm) > + mmu_notifier_unregister(&ubuf->notifier, ubuf->notifier.mm); > + kfree(ubuf->ranges); > kfree(ubuf->offsets); > kfree(ubuf->pages); > kfree(ubuf); > -- > 2.39.2 > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch