From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F008EEB64D7 for ; Mon, 26 Jun 2023 17:52:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5FFC78D0002; Mon, 26 Jun 2023 13:52:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5AF4A8D0001; Mon, 26 Jun 2023 13:52:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44F828D0002; Mon, 26 Jun 2023 13:52:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 353F18D0001 for ; Mon, 26 Jun 2023 13:52:41 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0331DC022A for ; Mon, 26 Jun 2023 17:52:40 +0000 (UTC) X-FDA: 80945644122.07.5DFE564 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf08.hostedemail.com (Postfix) with ESMTP id BDD8716001A for ; Mon, 26 Jun 2023 17:52:38 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YsRs1JzY; spf=pass (imf08.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687801958; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8Wv94JG6aQ6GP1kz8c1TkrJCAi33t9D7NLMkWYyjqIM=; b=eZ1hXZCj0GU/+qnIjjDIM2iiQM/cFnL8pdkADWAkCI30hugwewCv+4ZT9OAtiyyWMUE0t4 wNRd3X7+VlAhhb8QslCxcmbmOVKs2t+MoW8BzPiGoupMzpBlWkwhiVCNAfHyYyuCFPUlOf QQzZoYYGYYaZntkV5zQiOa4OVsXfjnM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687801958; a=rsa-sha256; cv=none; b=ohjl1ovTke5p+tpwdajW7Mk0DKKZN8e0ZGYFztagfH7sbn1qRQSmoPUjXIy3dpXNpgCdwH OCy/DxbAG/pJNH0P5KFCezHltqqgqUfA86Du5TvIoTeM5mZitS9Anc7O3L7YQ1uTCYOuWP Zib7gl30ndk+rMvFPSQJbX4Ie0yMAWk= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YsRs1JzY; spf=pass (imf08.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1687801958; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8Wv94JG6aQ6GP1kz8c1TkrJCAi33t9D7NLMkWYyjqIM=; b=YsRs1JzYGtxkPo+7eQyYrzcNqPWltt4p4A6tm5WCJyBIhudm1C2RcMZY7oYdBspq2MRmcy kxabl0fDMd9bVUW2S5Le1a8Q9jMqGix9iW8BaU8IMUms2kf3aOmwvHmazmw3qiRuBJBWTP kcROJ5xrxKso9+Lzrn8RB4M33Kj1ImA= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-386-7L75GGafMPG5w7wteerFUg-1; Mon, 26 Jun 2023 13:52:36 -0400 X-MC-Unique: 7L75GGafMPG5w7wteerFUg-1 Received: by mail-qt1-f200.google.com with SMTP id d75a77b69052e-4009567b05eso3549351cf.0 for ; Mon, 26 Jun 2023 10:52:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687801956; x=1690393956; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=8Wv94JG6aQ6GP1kz8c1TkrJCAi33t9D7NLMkWYyjqIM=; b=Xuz7y7eTauZVDuLP0VvmI3vfkeUt8mhDR8HRMdh1r8V/WRL91/O5wz6Yeh9xWsYAUg Gp4dTU2xiMc59/HF1AQ4+sJ7aZZ8LFVpGtTk0ozOLSWCn2hOdVmRSUdFw0KVEg1TBXAd VeIJskt/LnDsTK/A0myvTnXiKQ++LhKaBl4o3mkQoj7KjTBmhVZZJfNLLq26RokFDMFU pEjKZkcpLVMTmfcy0p+4+iUYXpU7eBCi+2XaLkk679FY5FY1j9pNBxgOVEv1t2wbZvoL GQz9h4AleSqb/LPM0YD1ndfAmL2P9ecV8RPKsyX1XYf7aalWoF3RjIqpAch6Rd3HDQ89 m7cQ== X-Gm-Message-State: AC+VfDz/3UnhNlXAjaTjr6XYkBWFe1d8qk++4DMbwraImmfUCAADXFjY tikkZ+gX+xQv9Uejz+93OkcHaBTSFWgVHdSObEv2C+WIfA2ABcTHOW/OuieXYd/jN1GQxiDdOOa qNShI39VmxubdSwYa5e4= X-Received: by 2002:a05:622a:170c:b0:400:8541:1cae with SMTP id h12-20020a05622a170c00b0040085411caemr13937379qtk.5.1687801956091; Mon, 26 Jun 2023 10:52:36 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4LORLmzvExRwkCpi/iITDsOfPTz73t7BF9jk8/78o64pWBB0QDV2wmNjuji4yHQ/adhZ4J0A== X-Received: by 2002:a05:622a:170c:b0:400:8541:1cae with SMTP id h12-20020a05622a170c00b0040085411caemr13937370qtk.5.1687801955762; Mon, 26 Jun 2023 10:52:35 -0700 (PDT) Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id cg13-20020a05622a408d00b003f4ed0ca698sm3325403qtb.49.2023.06.26.10.52.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 26 Jun 2023 10:52:35 -0700 (PDT) Date: Mon, 26 Jun 2023 13:52:34 -0400 From: Peter Xu To: "Kasireddy, Vivek" Cc: David Hildenbrand , "dri-devel@lists.freedesktop.org" , "linux-mm@kvack.org" , Mike Kravetz , Gerd Hoffmann , "Kim, Dongwon" , Andrew Morton , James Houghton , Jerome Marchand , "Chang, Junxiao" , "Kirill A . Shutemov" , "Hocko, Michal" , Muchun Song , Jason Gunthorpe , John Hubbard Subject: Re: [PATCH v1 0/2] udmabuf: Add back support for mapping hugetlb pages Message-ID: References: <20230622072710.3707315-1-vivek.kasireddy@intel.com> <6e429fbc-e0e6-53c0-c545-2e2cbbe757de@redhat.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: BDD8716001A X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: crs3m3c1w65481ca5u998xy3nyxxxhnq X-HE-Tag: 1687801958-111034 X-HE-Meta: U2FsdGVkX1+rrQclqugsJSK7YZyc84UT8LmtD2o780N0JwBB0U7XCBIF/lsIGcLpD8s0sDR9cJ0inf+b874Xi6qxCwv5BEklYQ90SrwvroA/qlSpmgS4hg/UAsfvCGBdxW1Rg93QoCFXd8WUiiOjsQdIJS2fp8CceEhVeJeXC0qAO9IRy24GX4Nl0Qbw0hNh7W3qskZT/cU1vuVckzsji15wckNd6zgfs7lT+oFtpGEtlazgr32mvRN0ZjuvZ7X5vkS54JvgkrhMT6r+qbgY8F3cpnx753ZjtKF/w85tqGN11P/4TmbPsVeb0Eh3h1X1WyQsp/EPVXpiJm/u/LoUWoNFAXjfBpKwEZRyjivfb5EElYcCZsJhjGI9RCcnP8R+w86QgmbzHyy7k6urhtjswu5SP7qgNPhhcXhna+h+mqVOKA5RE5xecxaFMW9a3AiyqOS6f6KmDlVxQ0/4xZQt/vehfzJeZBiSiL5+oKIdn2lgWVEW3kTmxGr+AAGdYy0mlsHStqcNn/NPz9aCCOX+fFL7FAFrzV5CawGlrXntG68vhnTHpT95JU9SlwVJxDWQVIzizvb1Q4bLoQgmccaW94ppv42cH/D3kt4hwJdWKiyLvbW+Ssi9MjKSOAoPbaPdMg81ZL6ov4HH/q1f5uZUPZ9LHoByXd4i/q43Ch0agDW/53aO7+uXMIPtrbHIKUq19GTnDJhdUf1rCXQhgIV1kKqtmMy+T7LDzcaQfBTs9AC1FH5zvA4X78SP2rtNxfIXrpYMtXs+I3rUuyeca+lutCIL7OnIpIY49or504uyQBBrvwQwgKEE7YteEEQVKAuCgJRybmT89y2qp2QKleeIwemicF92DLuivhAT/dfseZRIrzE7ymYMqXRVz+bJ7vRjprMa68kZvE1jrLfa4Gjyy+doNIxKpNVtWht0ja/EfMU74S4uJOZAZq6uBgikzLVFXiTgPhealm7V7nwrVqh xgp0z1gK 4yDxhTfZjSQrhRUybsVLJGgySwlqxjmHaF3Xgi0tdlpWPlyG3TKvrpByLeTASXFnmoUJp0qlbtV6Y4FPPozart65uR1hgqI+zjghX2PhD4lE7dLhKA3FKrso1/lfSeMPwQBbkxBuV1hwhRWnMVP28SZWzGwggBWxfAwvH32HvI12UR+52Ako5TNd6d/E1MVz5W3+m56o5fs1cImOcP0ESMbbBL2EnAX+iugIDUjGSXSJkBhgejZGefNO/Q0o6PrHtKy0Ts5T8A7+mHFMr72z3BdESoOQi3sgfBhJEjAs5uj2fkoRTUsOxwSO4DQlgG3Eafk+cFfcXDuTO5M2C0Zg/garg4sMXJ71hUT9gD9CQuPJ1Af8DscBSNfFxybK5c5FU4WY5p9kBQPkUa+y/UceZsw2GG+LdEZ/KSSwjNVG/qC1vmuEKiCHgJVY3lPfdhPi6qoi/cVSwjKvwvCYT1jAAM2Y4ng== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jun 26, 2023 at 07:45:37AM +0000, Kasireddy, Vivek wrote: > Hi Peter, > > > > > On Fri, Jun 23, 2023 at 06:13:02AM +0000, Kasireddy, Vivek wrote: > > > Hi David, > > > > > > > > The first patch ensures that the mappings needed for handling mmap > > > > > operation would be managed by using the pfn instead of struct page. > > > > > The second patch restores support for mapping hugetlb pages where > > > > > subpages of a hugepage are not directly used anymore (main reason > > > > > for revert) and instead the hugetlb pages and the relevant offsets > > > > > are used to populate the scatterlist for dma-buf export and for > > > > > mmap operation. > > > > > > > > > > Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 > > > > options > > > > > were passed to the Host kernel and Qemu was launched with these > > > > > relevant options: qemu-system-x86_64 -m 4096m.... > > > > > -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080 > > > > > -display gtk,gl=on > > > > > -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M > > > > > -machine memory-backend=mem1 > > > > > > > > > > Replacing -display gtk,gl=on with -display gtk,gl=off above would > > > > > exercise the mmap handler. > > > > > > > > > > > > > While I think the VM_PFNMAP approach is much better and should fix > > that > > > > issue at hand, I thought more about missing memlock support and > > realized > > > > that we might have to fix something else. SO I'm going to raise the > > > > issue here. > > > > > > > > I think udmabuf chose the wrong interface to do what it's doing, that > > > > makes it harder to fix it eventually. > > > > > > > > Instead of accepting a range in a memfd, it should just have accepted a > > > > user space address range and then used > > > > pin_user_pages(FOLL_WRITE|FOLL_LONGTERM) to longterm-pin the > > pages > > > > "officially". > > > Udmabuf indeed started off by using user space address range and GUP > > but > > > the dma-buf subsystem maintainer had concerns with that approach in v2. > > > It also had support for mlock in that version. Here is v2 and the relevant > > > conversation: > > > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2 > > > > > > > > > > > So what's the issue? Udma effectively pins pages longterm ("possibly > > > > forever") simply by grabbing a reference on them. These pages might > > > > easily reside in ZONE_MOVABLE or in MIGRATE_CMA pageblocks. > > > > > > > > So what udmabuf does is break memory hotunplug and CMA, because it > > > > turns > > > > pages that have to remain movable unmovable. > > > > > > > > In the pin_user_pages(FOLL_LONGTERM) case we make sure to migrate > > > > these > > > > pages. See mm/gup.c:check_and_migrate_movable_pages() and > > especially > > > > folio_is_longterm_pinnable(). We'd probably have to implement > > something > > > > similar for udmabuf, where we detect such unpinnable pages and > > migrate > > > > them. > > > The pages udmabuf pins are only those associated with Guest (GPU > > driver/virtio-gpu) > > > resources (or buffers allocated and pinned from shmem via drm GEM). > > Some > > > resources are short-lived, and some are long-lived and whenever a > > resource > > > gets destroyed, the pages are unpinned. And, not all resources have their > > pages > > > pinned. The resource that is pinned for the longest duration is the FB and > > that's > > > because it is updated every ~16ms (assuming 1920x1080@60) by the Guest > > > GPU driver. We can certainly pin/unpin the FB after it is accessed on the > > Host > > > as a workaround, but I guess that may not be very efficient given the > > amount > > > of churn it would create. > > > > > > Also, as far as migration or S3/S4 is concerned, my understanding is that all > > > the Guest resources are destroyed and recreated again. So, wouldn't > > something > > > similar happen during memory hotunplug? > > > > > > > > > > > > > > > For example, pairing udmabuf with vfio (which pins pages using > > > > pin_user_pages(FOLL_LONGTERM)) in QEMU will most probably not work > > in > > > > all cases: if udmabuf longterm pinned the pages "the wrong way", vfio > > > > will fail to migrate them during FOLL_LONGTERM and consequently fail > > > > pin_user_pages(). As long as udmabuf holds a reference on these pages, > > > > that will never succeed. > > > Dma-buf rules (for exporters) indicate that the pages only need to be > > pinned > > > during the map_attachment phase (and until unmap attachment happens). > > > In other words, only when the sg_table is created by udmabuf. I guess one > > > option would be to not hold any references during UDMABUF_CREATE and > > > only grab references to the pages (as and when it gets used) during this > > step. > > > Would this help? > > > > IIUC the refcount is needed, otherwise I don't see what to protect the page > > from being freed and even reused elsewhere before map_attachment(). > > > > It seems the previous concern on using gup was majorly fork(), if this is it: > > > > https://patchwork.freedesktop.org/patch/210992/?series=39879&rev=2#co > > mment_414213 > > > > Could it also be guarded by just making sure the pages are MAP_SHARED > > when > > creating the udmabuf, if fork() is a requirement of the feature? > > > > I had a feeling that the userspace still needs to always do the right thing > > to make it work, even using pure PFN mappings. > > > > For instance, what if the userapp just punchs a hole in the shmem/hugetlbfs > > file after creating the udmabuf (I see that F_SEAL_SHRINK is required, but > > at least not F_SEAL_WRITE with current impl), and fault a new page into the > > page cache? > IIUC, Qemu creates and owns the memfd that is associated with Guest memory. > And if it punches a hole in its own memfd that goes through Guest pinned pages > associated with buffers/resources, then I think the proper way to fix this is to > somehow notify the Guest driver (virtio-gpu in this case) that the backing store > of the affected resources is no longer valid and that the resources need to be > destroyed and re-created again. > > Having said that, one option I could think of is to probably install a mmu_notifier > associated with the relevant pages in udmabuf and once we get notified about > any invalidation event concerning any of the pages, we'd fail any subsequent > attempt to access these pages and propagate the error across the stack. Sounds right, maybe it needs to go back to the old GUP solution, though, as mmu notifiers are also mm-based not fd-based. Or to be explicit, I think it'll be pin_user_pages(FOLL_LONGTERM) with the new API. It'll also solve the movable pages issue on pinning. > > However, it feels like udmabuf is not the right place to handle this issue because > there are very limited options for taking proper corrective action if Qemu decides > to punch a hole in Guest memory that takes out pages that are pinned. I'm not familiar with the use case that much, but IMHO it's fine if the driver relies on proper behaviors of userapp to work. IIUC the worst case here is the udmabuf holds some pages that are not the pages of the guest mem anymore, but it only happens on misbehaved userspace, then it looks all fine as long as they can at least be properly released when releasing the udmabuf. It'll be better if the udmabuf can fail hard when detecting this, but IMHO even that can be optional depending on the need, while any corrective action will be even one step further. Thanks, -- Peter Xu