From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7160EE801F for ; Fri, 8 Sep 2023 16:48:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0CF146B00EA; Fri, 8 Sep 2023 12:48:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 07F346B00EB; Fri, 8 Sep 2023 12:48:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E62746B00EC; Fri, 8 Sep 2023 12:48:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D57906B00EA for ; Fri, 8 Sep 2023 12:48:13 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 9AF4BC03F2 for ; Fri, 8 Sep 2023 16:48:13 +0000 (UTC) X-FDA: 81214012866.19.0C904E8 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf28.hostedemail.com (Postfix) with ESMTP id 46B54C0015 for ; Fri, 8 Sep 2023 16:48:11 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VeKkuLjW; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694191691; a=rsa-sha256; cv=none; b=irZVzXexh9NzL/8CvBKKCfYbrQAWruYM3teLu0DImR6ETQren8QnpssqyU+fJUbLF9LWmD SUVMZrYycxmKV3mDC09ALyuYftn+lxN1/slruVKjK+8mgtzBPSLYkL3XKw9mbw7RCOeUYf ThGPPbTml44KixSXspRKl1+EQ7uHa70= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VeKkuLjW; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf28.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694191691; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HSHaK98jP52OExjU5ccVlB5Izk6T25mAruQGfLMmkDo=; b=1WWB1fZMrYZ001umDin7ZaA4oF6DDvQWIdR0fbCzyhOjlZp4Mi9fPrDTp6fEyB46H8iKR4 iOixnzP13SvXd2Vhe12Gji+DWrYM1bd2UTzgM/Yu2LCe0ohrem6G8+KID0qKlS1InFnv3g Qn+mRfcYKGyoC0nl6rXvJZLUAP2rArQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1694191690; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HSHaK98jP52OExjU5ccVlB5Izk6T25mAruQGfLMmkDo=; b=VeKkuLjWMvTbSI8BWbzfqbcBrW1Uemyp5MIHtbv6X1wy/QTH/Ezeb9ajJiKq5WVdV+M21p IdQJWYYsPdMhtn49eKcBjmGo87U2nk0KRIWAfu5I1OBj2Jwn6dFzX6Mch8iYRjA1xmWTBM QR7z1hbeaMl1rGwvLi61iwkRAMgDFck= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-670-IbHJURohPMyKTLkE0K4ssg-1; Fri, 08 Sep 2023 12:48:09 -0400 X-MC-Unique: IbHJURohPMyKTLkE0K4ssg-1 Received: by mail-wm1-f72.google.com with SMTP id 5b1f17b1804b1-401dba99384so17017345e9.3 for ; Fri, 08 Sep 2023 09:48:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694191688; x=1694796488; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=HSHaK98jP52OExjU5ccVlB5Izk6T25mAruQGfLMmkDo=; b=vJekJbLQsH1spzkdN5mbQvquKLqhYCZmbYflItaw+j52YtoDKcqH2C9kkpMlSZE8I3 nY6wlaEOUl5hUKWfI8D/2PxnY8dRMj22XNHUIAyJPYomDtw00E9FGF7epnize1yox7gK sOrVbxQ+UbKIZ/mUHvR58ZgF0nN5TZzTtotP+YbFkfTsSCbEHg8rwtEHcoVsWAEYJW/E +un7d85gbxtGmluTDp2q5YqYVBRnp1g9LKVHQ6xAiWE6S2FMhpKYqC7YYbr9NDtutRYk lhFaWOOSV8eaq6lTmZ7rEL/NALVQ1+eATo6QJy+3Hjer4PJRlSOR6klaFLs/U+79dB3/ fiww== X-Gm-Message-State: AOJu0YzQFMUJXFHBn/279uF75UsHwmbL2j/I+11PUrzMvNuK+R+ZomU6 CIewqa6ikfwnMUycx2o3QDJdzFHQ8VQTgr8OcCQCVfjHs5n6TkSaiIxF6CYBLtbnaLyJ5Tm40vV WvjXCj4Vtb/M= X-Received: by 2002:a05:600c:3b1e:b0:402:f536:41c5 with SMTP id m30-20020a05600c3b1e00b00402f53641c5mr2466126wms.3.1694191688109; Fri, 08 Sep 2023 09:48:08 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH6vY3W7GArv4nn32cMyQOaB0I2anFfX3q4OwjsxS4Vyij+Hd9beJRo+Vd7kIAet301I2JQHA== X-Received: by 2002:a05:600c:3b1e:b0:402:f536:41c5 with SMTP id m30-20020a05600c3b1e00b00402f53641c5mr2466087wms.3.1694191687701; Fri, 08 Sep 2023 09:48:07 -0700 (PDT) Received: from ?IPV6:2003:cb:c720:d00:61ea:eace:637c:3f0f? (p200300cbc7200d0061eaeace637c3f0f.dip0.t-ipconnect.de. [2003:cb:c720:d00:61ea:eace:637c:3f0f]) by smtp.gmail.com with ESMTPSA id y9-20020a7bcd89000000b003fed630f560sm2398991wmj.36.2023.09.08.09.48.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 08 Sep 2023 09:48:07 -0700 (PDT) Message-ID: <8698ba1f-fc5d-a82e-842b-100dc8957f2f@redhat.com> Date: Fri, 8 Sep 2023 18:48:05 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 To: Christoph Hellwig Cc: Jan Kara , David Howells , Peter Xu , Lei Huang , miklos@szeredi.hu, Xiubo Li , Ilya Dryomov , Jeff Layton , Trond Myklebust , Anna Schumaker , Latchesar Ionkov , Dominique Martinet , Christian Schoenebeck , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , John Fastabend , Jakub Sitnicki , Boris Pismenny , linux-nfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, ceph-devel@vger.kernel.org, linux-mm@kvack.org, v9fs@lists.linux.dev, netdev@vger.kernel.org References: <20230905141604.GA27370@lst.de> <0240468f-3cc5-157b-9b10-f0cd7979daf0@redhat.com> <20230908081544.GB8240@lst.de> From: David Hildenbrand Organization: Red Hat Subject: Re: getting rid of the last memory modifitions through gup(FOLL_GET) In-Reply-To: <20230908081544.GB8240@lst.de> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 46B54C0015 X-Stat-Signature: wzcmsy9kb6fq79fdme9isqowdyxyk93m X-HE-Tag: 1694191691-365080 X-HE-Meta: U2FsdGVkX1+mV5NGAz2UHtChHb1ALuAELkBJwLV7zjKsIs7sJ7Gw6iSGL5gXjiLrZuc5sfRvpp9x9MsNl9+lAb0TaJ8fy2XyS6GISg7PCakALpEZbiU2Jtgo7BDU9k+saNeFYwMFsU04NLdtfE9DsHaBvstEbsBZcpduPkl0fy0s73rwl293ql/4zYyUYK7e1GWdqNiaMR0bVrsMqT5F2CDzlE8oFbm9p0JueTzenBJf8TgFp1q/4qRlfbFNDD2in+CATaFEcyHHrRILvz/ZYQV52TWXnFuGNEIkk6SPgvbgp6Q2pY2q9XR+Fh3KA9d/u+gjwbiRS9VPHQFLOEgmKrdwY5L17+xHdADcbt34QlPzuHARjZtWTXxH68659PnTWAcC8LjVoWqBHuvFAgXI+nzCLnDQ97B8l4bZwBnoIs8ZTy2CcOQP6Co7uZXkxflyjvY8EfXOvU2MDjr04So54KoykDjL0OCiV5GsiqJanYvZX8ail1SwNiSD6le2I4nWPkr/1Z5z/Eo2ydP5OzIeZ0yQCZcFtPPzEOgAgMBftU+bdsE8tx+j3eEImv3h5EsQBE7fC2Hb94x+L0p0keIK3DBNERAX33Kg/+1yGSj20DZrgP+OUwyDH2uO/5c96cdJr9kzn4RgH15Su18EdE681524ui5ICqB9qmEg0F3jHqIznkh2Ltsm1Wf2xzaJY2vpQVvcDpN1/uITnmXilaw2eAxIdn3TXSCZnEJNuhkAYAgJYSpiP72XdwloPmXuEO2JUrGl6PEJ7W04qsSj2MwtVNFMxBj4PwSzOCd6/0tKts1mLkKJgLcvv6aZ+uSI4YBn9YTBGdutfcSEZDzHfMf2S0Dr4iU8lg5JOyKoPjilLkcsv1tqr9k8i/SaXKy3folfYK+bzXR9FhYauihwJK9bAKQHMsOuG9oOe929jhh6nwulYeqN7Xu/HlMZ7QYlzu5ZM/m+Zi3Io8zITlQrCDg HxUpZbPd 4gxrc5TJSoD2omkzCbdnzHpx7B0qI/tofGw34KV5w/jtLEeq9FBeNEOeMt0LhVIJ8OOAfGR/QSjHU2uLGFdBXWNU1OXmpJALBk51AbfpOpuFNMg/pL1N+kGL7AxCUWqDaYK/5BD3qWSLf2cztw1kbI3+gcqtmYY9FC5324sTPk0a/Rn7OaqIQcTIDC3VYDqc2hkoFQSE6u65FlpiKwZgRjJhGvaxjdQLUKiBm9Jdk4GKVqcl31GKIoUUS118jnD859KYBR6ApYNqxjQXkReLaVuwj0itPXajAMxfT9tiAQJufefoyDY4yHRCYw9w6WHn+4hsq612dknymamuBn+Bd2Msjr/CRmts/VoGwH4gCsMTxRpIgA0VV/el8IALbj1gj9iVw5y3w2G05NMl+doY1Cu0fjqRxBT6iXI26eiVS//1uvIy9DrEmjD+zwocPKlt8c8jy4Si3pyRDGk8lA61+h7p8SpX85iJfyB+zlhuOyQYIBwBTRjbWJpyPVZPn0um2DFFAjhL6O1Mrebp1rCh7s4k0aTyhBLWQLKWsB4x3EGuEInfRH4E69lh+/xOdJd273YOT97wWiVBytW0sLqZqZqsP6yu6yzynrr7OPridHnsT/hNcmmXNOcMhT8Dlfy9qhuRKJHGf/TrMXfA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 08.09.23 10:15, Christoph Hellwig wrote: > On Wed, Sep 06, 2023 at 11:42:33AM +0200, David Hildenbrand wrote: >>> and iov_iter_get_pages_alloc2. We have three file system direct I/O >>> users of those left: ceph, fuse and nfs. Lei Huang has sent patches >>> to convert fuse to iov_iter_extract_pages which I'd love to see merged, >>> and we'd need equivalent work for ceph and nfs. >>> >>> The non-file system uses are in the vmsplice code, which only reads >> >> vmsplice really has to be fixed to specify FOLL_PIN|FOLL_LONGTERM for good; >> I recall that David Howells had patches for that at one point. (at least to >> use FOLL_PIN) > > Hmm, unless I'm misreading the code vmsplace is only using > iov_iter_get_pages2 for reading from the user address space anyway. > Or am I missing something? It's not relevant for the case you're describing here ("last memory modifitions through gup(FOLL_GET)"). vmsplice_to_pipe() -> iter_to_pipe() -> iov_iter_get_pages2() So it ends up calling get_user_pages_fast() ... and not using FOLL_PIN|FOLL_LONGTERM Why FOLL_LONGTERM? Because it's a longterm pin, where unprivileged users can grab a reference on a page for all eternity, breaking CMA and memory hotunplug (well, and harming compaction). Why FOLL_PIN? Well FOLL_LONGTERM only applies to FOLL_PIN. But for anonymous memory, this will also take care of the last remaining hugetlb COW test (trigger COW unsharing) as commented back in: https://lore.kernel.org/all/02063032-61e7-e1e5-cd51-a50337405159@redhat.com/ > >>> After that we might have to do an audit of the raw get_user_pages APIs, >>> but there probably aren't many that modify file backed memory. >> >> ptrace should apply that ends up doing a FOLL_GET|FOLL_WRITE. > > Yes, if that ends up on file backed shared mappings we also need a pin. See below. > >> Further, KVM ends up using FOLL_GET|FOLL_WRITE to populate the second-level >> page tables for VMs, and uses MMU notifiers to synchronize the second-level >> page tables with process page table changes. So once a PTE goes from >> writable -> r/o in the process page table, the second level page tables for >> the VM will get updated. Such MMU users are quite different from ordinary >> GUP users. > > Can KVM page tables use file backed shared mappings? Yes, usually shmem and hugetlb. But with things like emulated NVDIMMs/virtio-pmem for VMs, easily also ordinary files. But it's really not ordinary write access through GUP. It's write access via a secondary page table (secondary MMU), that's synchronized to the process page table -- just like if the CPU would be writing to the page using the process page tables (primary MMU). > >> Converting ptrace might not be desired/required as well (the reference is >> dropped immediately after the read/write access). > > But the pin is needed to make sure the file system can account for > dirtying the pages. Something we fundamentally can't do with get. ptrace will find the pagecache page writable in the page table (PTE write bit set), if it intends to write to the page (FOLL_WRITE). If it is not writable, it will trigger a page fault that informs the file system. With an FS that wants writenotify, we will not map a page writable (PTE write bit not set) unless it is dirty (PTE dirty bit set) IIRC. So are we concerned about a race between the filesystem removing the PTE write bit (to catch next write access before it gets dirtied again) and ptrace marking the page dirty? It's a very, very small race window, staring at __access_remote_vm(). But it should apply if that's the concern. > >> The end goal as discussed a couple of times would be the to limit FOLL_GET >> in general only to a couple of users that can be audited and keep using it >> for a good reason. Arbitrary drivers that perform DMA should stop using it >> (and ideally be prevented from using it) and switch to FOLL_PIN. > > Agreed, that's where I'd like to get to. Preferably with the non-pin > API not even beeing epxorted to modules. Yes. However, secondary MMU users (like KVM) would need some way to keep making use of that; ideally, using a proper separate interface instead of (ab)using plain GUP and confusing people :) [1] https://lkml.org/lkml/2023/1/24/451 -- Cheers, David / dhildenb