From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05D29C52D7C for ; Thu, 15 Aug 2024 17:21:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8F9156B00A7; Thu, 15 Aug 2024 13:21:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 85A596B0186; Thu, 15 Aug 2024 13:21:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6AEA36B00BA; Thu, 15 Aug 2024 13:21:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 4A0086B0186 for ; Thu, 15 Aug 2024 13:21:13 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id F0C6E161584 for ; Thu, 15 Aug 2024 17:21:12 +0000 (UTC) X-FDA: 82455145584.24.622F924 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf09.hostedemail.com (Postfix) with ESMTP id B41EA14003B for ; Thu, 15 Aug 2024 17:21:10 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Yil4pVDn; spf=pass (imf09.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723742397; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0UdGNGM4n0EJXKPbA0XiOUzOnUQaUwtu8wg6uziglrg=; b=SPpztSEE51z8Zhvghu8GiC3rRwKu/IJ03tJTxzpmRf2+bMKvVabl10JMfIuykpY0lMm8nP RTAqiAOoTLyNFJIQOMi8mhm0YDPT5bt/38wjZgvY9XwZsS3W4X3A6B3vCfPH2H0iMwsS6u t1/C7d1nd1fm/57D7jvqo+uHRWEuv/A= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723742397; a=rsa-sha256; cv=none; b=OTZlJbqXFQpZRdRhq30n31IaLNEDjcjm9p79ZPlk/W8aFBtRizGgGSqRFc0LP+niplta79 nEVpurnENPWEMmXuHwfZCiK0CYLLkcap3T0dx8Aeaqrvg4hLSfoiPUU3c1Wy/fXKjwmPTv SR5csD10YNfl5XyrmjclU2waIA2SO1Q= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Yil4pVDn; spf=pass (imf09.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1723742470; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=0UdGNGM4n0EJXKPbA0XiOUzOnUQaUwtu8wg6uziglrg=; b=Yil4pVDns/w3lCXtWZJNbZNs917Yqt4gxlUE5rIhwvxfAFWdldDGCOHtdW6jdCOhFGmqaJ zGunZsOdJsWROKftuP2dIifVWd11LTDfK8RkMm0m9+XCNYWmGYi4479Pj8zZ3uxriZZrgB zE1g2hcetSFc9WXUAHqFrJXckQ6PdQI= Received: from mail-vs1-f70.google.com (mail-vs1-f70.google.com [209.85.217.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-387-o_L6L3RlN_efJ6qX8xvTfg-1; Thu, 15 Aug 2024 13:21:06 -0400 X-MC-Unique: o_L6L3RlN_efJ6qX8xvTfg-1 Received: by mail-vs1-f70.google.com with SMTP id ada2fe7eead31-4929bf9202eso54857137.0 for ; Thu, 15 Aug 2024 10:21:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723742466; x=1724347266; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=0UdGNGM4n0EJXKPbA0XiOUzOnUQaUwtu8wg6uziglrg=; b=nP6gKB8zkzDCvT7nUss8k0w0nqrMXKRb0Qli9i15s7RGImPwANBegt/LdRnD0sgwhJ OzjNlsMPjvgyuDykx3Pl88dPE1RtT7jBhFtiqgXEgtduFGrxVVy9wdBCtmqjybo8OSwa grBMZDCW+0phGEhqXmihg9tD61mNJxTpGaKn2Mc3Gx0GtZoYBvGbbR0SQKHo4Fa+4CZU YkB5YCtllXT0fGDNyTyrNGlPhznvXAZ3eGkj0EmAJ56TOrLKI5sA0aV3iIiXLXjMDfq/ GAAN+TDvDadh/vai/FPBWKwkJP711uRlGumTkw/JTM81acEzY1mj1rvLeevgqi065kJb JM9Q== X-Gm-Message-State: AOJu0Yxs2LUuoHccsA8IBkg4kYw/5TaqRGkbSrlpS/0CeXsJ7CDHnTyP vo7/3lguMAl7r0geXcWKvR4xEtEMaLVm9+Gktl2YzOgUhhGDflIotjzq64fn4FtJQgdY8hCc/0A AYXr1CmC/b1gQGmQGzT/agVgeFnWHDdCOGIB4Be8lHsGk40Sn X-Received: by 2002:a05:6102:1614:b0:491:1e5b:8a0a with SMTP id ada2fe7eead31-497799c449bmr338938137.4.1723742465929; Thu, 15 Aug 2024 10:21:05 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFwrVDaObp41gdMHIuT2tGq6ulUs7glamf+9vgeunYMavhW1nsyHH5jRDCChiVsv3XfQ+9REQ== X-Received: by 2002:a05:6102:1614:b0:491:1e5b:8a0a with SMTP id ada2fe7eead31-497799c449bmr338922137.4.1723742465544; Thu, 15 Aug 2024 10:21:05 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6bf6ff0dcdasm7929396d6.140.2024.08.15.10.21.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 15 Aug 2024 10:21:04 -0700 (PDT) Date: Thu, 15 Aug 2024 13:21:01 -0400 From: Peter Xu To: Jason Gunthorpe Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Sean Christopherson , Oscar Salvador , Axel Rasmussen , linux-arm-kernel@lists.infradead.org, x86@kernel.org, Will Deacon , Gavin Shan , Paolo Bonzini , Zi Yan , Andrew Morton , Catalin Marinas , Ingo Molnar , Alistair Popple , Borislav Petkov , David Hildenbrand , Thomas Gleixner , kvm@vger.kernel.org, Dave Hansen , Alex Williamson , Yan Zhao Subject: Re: [PATCH 09/19] mm: New follow_pfnmap API Message-ID: References: <20240809160909.1023470-1-peterx@redhat.com> <20240809160909.1023470-10-peterx@redhat.com> <20240814131954.GK2032816@nvidia.com> <20240814221441.GB2032816@nvidia.com> <20240815161603.GH2032816@nvidia.com> MIME-Version: 1.0 In-Reply-To: <20240815161603.GH2032816@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B41EA14003B X-Stat-Signature: tsc6eh9myze3if1qwmsyaaa6uuqtozhr X-HE-Tag: 1723742470-42189 X-HE-Meta: U2FsdGVkX1/VfTG3NxraXMDk+jJEaIpcw5j/yYQee90TpRQHH1k3/WTDeno1HiNhSHZ3xe5VN38kQtjVU2T/+8+wxMhwNusCxNCpDfyvu1yRdWXucfB3FT7n3nL4vLIsIiB8nDXRmYWWxY1i38hG4CJa6YKE67nGRPMK82Vtv8+EJeTGAxTlT8W+ZwONQx/htpbb0V+KW6T54DxWM2+CAzVX0pRSe3ngLtGB1vJ9T6z21ZwU6C/2B++NxnxUBS9dJR4uTvRSIjdPJ2dBv0IIDe4Qx0KUL0h+slshZ058ARVUi6pJJuqEr65TBz2BEq6rqJFdpQsB9isry7ZB6GBwJBNanY7uPa1+qdR669/MQsA3fMemIqvN2P2toDUadUkZDMLbX8x6crxprevmzEX1L2oT2DZJGD8U96DzS8TdwUxXq30xUsqPQWrz9M4AgqgJF/RVASx3hpehdPRhSo5jC/Eefcua7deO9vloQT1z273s4pkciwxNbSpnS2VEyq0v8Ouz/yenKZdWDKwmd506k18JU8tLw9NyseyzFpTxRHIk8n7gcibkC3al0iC/4Y+hUr1HL68dik8JU6jJ7IrlrPFUfTTJFde4hAHGo2txFPd7ebOMplPHErYOQF9JCEcSWtu77k2cGuJxw/zKXK1v7SKsElR9BXOyicprY+z/DC3vyiA0tlE31dZ08CL25zueNJCglMzgzk2yKpeiGuta3/fBvjLSyU6f1a7cKrMgfBeGVO4c4keKkWhpPwFWBMuRj4TPvLuNdjFxtpn9sCmzOW5LeXt7AD6K8Xj0lkl9EUnnC4YqclY9rinXoEE21601fZzxl9oigLD/IhBKA/lIJDMDrZvoVIXqvkpjVh3LwRWZhATnak03Mr1/UEvpfdiGGqZ6c9gWkz6KHJ9e+O/Ao29toKuaAwL9m5fwn5L3/62F1FLN4fxGFiaoT3d+VMkj5nSwaEajZgh6AHSJwaM LoEr3DR7 WVJE5YcATEbCoddXT3uoAyjom3IxfttzrGrl05MZyz5/hKi2Ud78NrZ6XItSo/NttMPSKJvYc2/kzAT3Gyjtr4iduHfSMyV+tJC4Q4vyH0as98OQ+KjBTBmZuTddeI7nh8EAc6mmMNYMSzAP67fwJgKTyaouDMj+2tDGuv/cz7PtaDCrUTs32uInfi7Skicp1Db76fTvsZq+xH2rI5cwEQBVcWAI1bAxsYu2l9UyUQP4sJYbh6Zdab7J4zKQsSwneOinpyBrfwFXUUzoN39EzO/cpswnHdUmnJV/yCaRt3XU3hyLBdt7CrCU5Xsn3HkPF7Odwkf7Yj0mhpPAd5yDAV7JAI05luJa+wyLJTyCXOQJp8//37xCsNO0L87ARQM7S7pZTc6COPyfwzvI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 15, 2024 at 01:16:03PM -0300, Jason Gunthorpe wrote: > On Thu, Aug 15, 2024 at 11:41:39AM -0400, Peter Xu wrote: > > On Wed, Aug 14, 2024 at 07:14:41PM -0300, Jason Gunthorpe wrote: > > > On Wed, Aug 14, 2024 at 02:24:47PM -0400, Peter Xu wrote: > > > > On Wed, Aug 14, 2024 at 10:19:54AM -0300, Jason Gunthorpe wrote: > > > > > On Fri, Aug 09, 2024 at 12:08:59PM -0400, Peter Xu wrote: > > > > > > > > > > > +/** > > > > > > + * follow_pfnmap_start() - Look up a pfn mapping at a user virtual address > > > > > > + * @args: Pointer to struct @follow_pfnmap_args > > > > > > + * > > > > > > + * The caller needs to setup args->vma and args->address to point to the > > > > > > + * virtual address as the target of such lookup. On a successful return, > > > > > > + * the results will be put into other output fields. > > > > > > + * > > > > > > + * After the caller finished using the fields, the caller must invoke > > > > > > + * another follow_pfnmap_end() to proper releases the locks and resources > > > > > > + * of such look up request. > > > > > > + * > > > > > > + * During the start() and end() calls, the results in @args will be valid > > > > > > + * as proper locks will be held. After the end() is called, all the fields > > > > > > + * in @follow_pfnmap_args will be invalid to be further accessed. > > > > > > + * > > > > > > + * If the PTE maps a refcounted page, callers are responsible to protect > > > > > > + * against invalidation with MMU notifiers; otherwise access to the PFN at > > > > > > + * a later point in time can trigger use-after-free. > > > > > > + * > > > > > > + * Only IO mappings and raw PFN mappings are allowed. > > > > > > > > > > What does this mean? The paragraph before said this can return a > > > > > refcounted page? > > > > > > > > This came from the old follow_pte(), I kept that as I suppose we should > > > > allow VM_IO | VM_PFNMAP just like before, even if in this case I suppose > > > > only the pfnmap matters where huge mappings can start to appear. > > > > > > If that is the intention it should actively block returning anything > > > that is vm_normal_page() not check the VM flags, see the other > > > discussion.. > > > > The restriction should only be applied to the vma attributes, not a > > specific pte mapping, IMHO. > > > > I mean, the comment was describing "which VMA is allowed to use this > > function", reflecting that we'll fail at anything !PFNMAP && !IO. > > > > It seems legal to have private mappings of them, where vm_normal_page() can > > return true here for some of the mappings under PFNMAP|IO. IIUC either the > > old follow_pte() or follow_pfnmap*() API cared much on this part yet so > > far. > > Why? Either the function only returns PFN map no-struct page things or > it returns struct page stuff too, in which case why bother to check > the VMA flags if the caller already has to be correct for struct page > backed results? > > This function is only safe to use under the proper locking, and under > those rules it doesn't matter at all what the result is.. Do you mean we should drop the PFNMAP|IO check? I didn't see all the callers to say that they won't rely on proper failing of !PFNMAP&&!IO vmas to work alright. So I assume we should definitely keep them around. Or I could have totally missed what you're suggesting here.. > > > > > > + * The mmap semaphore > > > > > > + * should be taken for read, and the mmap semaphore cannot be released > > > > > > + * before the end() is invoked. > > > > > > > > > > This function is not safe for IO mappings and PFNs either, VFIO has a > > > > > known security issue to call it. That should be emphasised in the > > > > > comment. > > > > > > > > Any elaboration on this? I could have missed that.. > > > > > > Just because the memory is a PFN or IO doesn't mean it is safe to > > > access it without a refcount. There are many driver scenarios where > > > revoking a PFN from mmap needs to be a hard fence that nothing else > > > has access to that PFN. Otherwise it is a security problem for that > > > driver. > > > > Oh ok, I suppose you meant the VFIO whole thing on "zapping mapping when > > MMIO disabled"? If so I get it. More below. > > And more.. > > > > > The user needs to do proper mapping if they need an usable address, > > > > e.g. generic_access_phys() does ioremap_prot() and recheck the pfn didn't > > > > change. > > > > > > No, you can't take the phys_addr_t outside the start/end region that > > > explicitly holds the lock protecting it. This is what the comment must > > > warn against doing. > > > > I think the comment has that part covered more or less: > > > > * During the start() and end() calls, the results in @args will be valid > > * as proper locks will be held. After the end() is called, all the fields > > * in @follow_pfnmap_args will be invalid to be further accessed. > > > > Feel free to suggest anything that will make it better. > > Be much more specific and scary: > > Any physical address obtained through this API is only valid while > the @follow_pfnmap_args. Continuing to use the address after end(), > without some other means to synchronize with page table updates > will create a security bug. Some misuse on wordings here (e.g. we don't return PA but PFN), and some sentence doesn't seem to be complete.. but I think I get the "scary" part of it. How about this, appending the scary part to the end? * During the start() and end() calls, the results in @args will be valid * as proper locks will be held. After the end() is called, all the fields * in @follow_pfnmap_args will be invalid to be further accessed. Further * use of such information after end() may require proper synchronizations * by the caller with page table updates, otherwise it can create a * security bug. > > > For generic_access_phys() as a specific example: I think it is safe to map > > the pfn even after end(). > > The map could be safe, but also the memory could be hot unplugged as a > race. I don't know either way if all arch code is safe for that. I hope it's ok, or we have similar problem with follow_pte() for all theseyears.. in all cases, this sounds like another thing to be checked outside of scope of this patch.. > > > After the map, it rewalks the pgtable, making sure PFN is still there and > > valid, and it'll only access it this time before end(): > > > > if (write) > > memcpy_toio(maddr + offset, buf, len); > > else > > memcpy_fromio(buf, maddr + offset, len); > > ret = len; > > follow_pfnmap_end(&args); > > Yes > > > If PFN changed, it properly releases the mapping: > > > > if ((prot != pgprot_val(args.pgprot)) || > > (phys_addr != (args.pfn << PAGE_SHIFT)) || > > (writable != args.writable)) { > > follow_pfnmap_end(&args); > > iounmap(maddr); > > goto retry; > > } > > > > Then taking the example of VFIO: there's no risk of racing with a > > concurrent zapping as far as I can see, because otherwise it'll see pfn > > changed. > > VFIO dumps the physical address into the IOMMU and ignores > zap. Concurrent zap results in a UAF through the IOMMU mapping. Ohhh, so this is what I'm missing.. It worked for generic mem only because VFIO pins them upfront so any form of zapping won't throw things away, but we can't pin pfnmaps yet so far. It sounds like we need some mmu notifiers when mapping the IOMMU pgtables, as long as there's MMIO-region / P2P involved. It'll make sure when tearing down the BAR mappings, the devices will at least see the same view as the processors. Thanks, -- Peter Xu