From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 899B2EB64D9 for ; Fri, 7 Jul 2023 19:15:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E40F26B0072; Fri, 7 Jul 2023 15:15:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DF0E28D0002; Fri, 7 Jul 2023 15:15:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C915B8D0001; Fri, 7 Jul 2023 15:15:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id B9DAB6B0072 for ; Fri, 7 Jul 2023 15:15:14 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8617440128 for ; Fri, 7 Jul 2023 19:15:14 +0000 (UTC) X-FDA: 80985768948.27.57D92E2 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 5BC0C4000A for ; Fri, 7 Jul 2023 19:15:10 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=f0bqh1XF; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688757311; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XRKMINojVLdITqbh6Qws5X2HaU+JqwI7I90nRvPiP7o=; b=KMoJDF7GhItMykHEt2yTjAmFGVnvQJMH26AQ8cmH5GHFxmimAG4XAoNMjzfPyz9HU9rE9m OgQ6TqOyDVbpG3l8FpFLGQ8KbWTy4Q9Kvp8OJlBHjnpr2ioite6hGkQunrKTUtfQ3whDbU zO7ODw2q99S6RPRIGmE6FBnkp7nXWnk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688757311; a=rsa-sha256; cv=none; b=7hFMYOumJlSOd4OWlN6ZY7Og6pW1G2XQNliu8BnvIacMhus5Vel2I+0a37R9LDpfqQlMqV +nnjuMZFOp4U93rTQ327+llg2rCFB/Iw+AHJ5tYmaiD0ez8im83fA/7usOs+6sJeQo6eOh 8T+jnSJxdScGNWW5xxxgctFGqHr5P3w= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=f0bqh1XF; spf=pass (imf11.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1688757309; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XRKMINojVLdITqbh6Qws5X2HaU+JqwI7I90nRvPiP7o=; b=f0bqh1XFKFlzNga9wPBUi9TQo45YB4kOZUvcdi481wxUp/cqC9fo0LkuTVwPYA/gXvntzY B9Ln0P8DPu9Ku1MASTyceSDKzHR27UO0XiYGeD1ro/C9IFUXUAU61ug+6m3Uq9YVTSrfHr 3ydwze7QhnmJmKccEhChFzlx7mCYaeY= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-491-tPr_ajGpOUy_YmhNdcUjxw-1; Fri, 07 Jul 2023 15:15:05 -0400 X-MC-Unique: tPr_ajGpOUy_YmhNdcUjxw-1 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-31273e0507dso1217932f8f.3 for ; Fri, 07 Jul 2023 12:15:05 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688757304; x=1691349304; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=XRKMINojVLdITqbh6Qws5X2HaU+JqwI7I90nRvPiP7o=; b=E8QpNWYCVY51yiCQ/F9L/RVBn4L36FELVMM999h1JrsouiKkdaIKqhOTEA52wbEI/I s0HUoX7482xILL4F9sdk0YaclcD68LxyAkxuP8dG2k2wPpxJg63QuRhxfxN2Wi4c9zE2 NWsXZmVc9Nccf5A14rXNHO6A78tkMcglnZFcLgaNlF9Epl9pW3DYhRmtbiqEXcBw+t4I GQLf4Qyt26K4BO5qFiqo6iRQsrv5fQi7t2AEbTlo9zvb9hRS0FMtPLj8suSnIZNMt9C9 qVJN3nUKhCtAByKP9quqnJfrnub0LzF4QvEjXA+bmA0o0YtmrG1rm+rgBWkTlXkti+yz bVCw== X-Gm-Message-State: ABy/qLZhTpjsxajk0a58SaP3oMpiFa78JcfxfpEtwbfsyld3/Zh4CORp tRMgWpHBwNv6jt2dyrvacjwnlcKTkBISf9P4xgwB/Lmd5Exe1MAiwsovnq4uDjahzsvmDeMUMk1 Uz+/B/QtGoGtBqd4Id1g= X-Received: by 2002:a5d:5450:0:b0:314:3503:15ac with SMTP id w16-20020a5d5450000000b00314350315acmr4913388wrv.10.1688757304107; Fri, 07 Jul 2023 12:15:04 -0700 (PDT) X-Google-Smtp-Source: APBJJlGEMNtQdKckP4Dc4tAeLngfG/Vp+EPGbkObZIVzmDdYwDN6Gm3G2yv8ZHItdXeDqHW94Nsz3A== X-Received: by 2002:a5d:5450:0:b0:314:3503:15ac with SMTP id w16-20020a5d5450000000b00314350315acmr4913378wrv.10.1688757303730; Fri, 07 Jul 2023 12:15:03 -0700 (PDT) Received: from ?IPV6:2003:d8:2f04:3c00:248f:bf5b:b03e:aac7? (p200300d82f043c00248fbf5bb03eaac7.dip0.t-ipconnect.de. [2003:d8:2f04:3c00:248f:bf5b:b03e:aac7]) by smtp.gmail.com with ESMTPSA id e16-20020a5d4e90000000b0031424f4ef1dsm5158897wru.19.2023.07.07.12.15.02 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 07 Jul 2023 12:15:03 -0700 (PDT) Message-ID: <5c9bf622-0866-168f-a1cd-4e4a98322127@redhat.com> Date: Fri, 7 Jul 2023 21:15:02 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: Matthew Wilcox Cc: Yin Fengwei , linux-mm@kvack.org, linux-kernel@vger.kernel.org, yuzhao@google.com, ryan.roberts@arm.com, shy828301@gmail.com, akpm@linux-foundation.org References: <20230707165221.4076590-1-fengwei.yin@intel.com> <4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH 0/3] support large folio for mlock In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: rrzdi51o7sxkxf7c7gd48kcerj8a7how X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 5BC0C4000A X-Rspam-User: X-HE-Tag: 1688757310-442210 X-HE-Meta: U2FsdGVkX1/I0lSt8XY7vqaynIzA90Pk6Ke9/4sWU+Vz4+svcA35hEaLdKrsNoEEZCjQwBNr6Q4gQJTw02gYj6eaW8CQ7863pOAGNGcaNXrBZZopBn+9z5IPFc4/QRKK3zgfwQpx/LIJ0LiIWrRWFb0iDj3ma57tVSzjDL7EFzbNfcPmPGNxDeGrJ3zJ41yBPFfq7gcs7wnpnmzBg8le3YvxaAhqD8K8YKl2MMRjoVSUelj4huoXXRGbDKSOhie5nJAR+oO2SvWMr4RSPJ6P82DzATkF2RQcYvzsd1PXQ2DL5bn9cs1nI6NcDQuZB6bKPsRFaYgfpzUNm3oCKtnP1KWC0BQMX/zuTncvLb6p4skNuBlZFnMTkHzS8YMRhqKOvtdBkmCQDC4AE981MHevZt2CE6V7qiZpGXg6alDh4Br6Cpjb4ETGWeUMozo4bOEPSYKk2503U/3netPKCgO5kQfVBh7N7a+Y1kJ3ZGcy+fmK0HdCPK98T1iFJaLCI8+WniOICwZHjfBv7q8WoWAW8vHkYO02a2a6bc8rgHEsO3dZ1gianDGrv3/UUrfR67WHD8562bIZ5txyyOs+RJVYp42JkLe8gSb4kRihxloIgiVcTKLCPVaKUQSnPeAPQcGGt+IQ9Ec6TBvRI3j/Jl1vVubOw/MJV+HQDwgKap8C3XmREZR+Dynb6eTjk/RRRpKlkmo0Na1mYv0DqBbCmcA3xIWrcxRIrolO8RtxWxmPZy5qTScaXmzP9Difl+az7DMY0Z1MSJEizjPdx2vJCfemH+04WEAxTEawRpcE7kDZhKXl22tgVOoFV0Uqbh0LIf5x5THU9jwm4iMvxo3mXHQ5JtW/jaNasTWERER/Jsfjj2qv6h10LpOirr+E5EfhelUXVf2KS3FXi6kSa+MZPY4ltGMq2wBgUPTm+43PTDBCFmTRoKiOWCtMxxRhqD5iR6qkNcxHB0Tm3+rjQYlH3tQ px0IDZ1o awR1BIF819/tpbPFB1YKvKSeyI+nSZe6/Agzy9czzHkpjDd7qrSLBXfkgIo+Npw3mzkt3IByhK4wzqgLVdo+LLg68W4mS5VhHoaKLBeohcBjdN6o/2mXUSxRmbCcfQMQG82zgLg7uWWcPzq+ece6WF5R7lYFwSa7p5QTm8F4rcBvSnQSwKCt6zjGYjkeS8AQDwQNTU+LPm/Oz+nLXEGnM+Y6OhcZnDii44CEn/JE0fbGl7uK31uY77YmsOBM/BtGdIVSz53dEPFB2cXnFZNU9gRMhMEyiK6jMvs6w X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 07.07.23 21:06, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote: >> On 07.07.23 19:26, Matthew Wilcox wrote: >>> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: >>>> This series identified the large folio for mlock to two types: >>>> - The large folio is in VM_LOCKED VMA range >>>> - The large folio cross VM_LOCKED VMA boundary >>> >>> This is somewhere that I think our fixation on MUST USE PMD ENTRIES >>> has led us astray. Today when the arguments to mlock() cross a folio >>> boundary, we split the PMD entry but leave the folio intact. That means >>> that we continue to manage the folio as a single entry on the LRU list. >>> But userspace may have no idea that we're doing this. It may have made >>> several calls to mmap() 256kB at once, they've all been coalesced into >>> a single VMA and khugepaged has come along behind its back and created >>> a 2MB THP. Now userspace calls mlock() and instead of treating that as >>> a hint that oops, maybe we shouldn't've done that, we do our utmost to >>> preserve the 2MB folio. >>> >>> I think this whole approach needs rethinking. IMO, anonymous folios >>> should not cross VMA boundaries. Tell me why I'm wrong. >> >> I think we touched upon that a couple of times already, and the main issue >> is that while it sounds nice in theory, it's impossible in practice. >> >> THP are supposed to be transparent, that is, we should not let arbitrary >> operations fail. >> >> But nothing stops user space from >> >> (a) mmap'ing a 2 MiB region >> (b) GUP-pinning the whole range >> (c) GUP-pinning the first half >> (d) unpinning the whole range from (a) >> (e) munmap'ing the second half >> >> >> And that's just one out of many examples I can think of, not even >> considering temporary/speculative references that can prevent a split at >> random points in time -- especially when splitting a VMA. >> >> Sure, any time we PTE-map a THP we might just say "let's put that on the >> deferred split queue" and cross fingers that we can eventually split it >> later. (I was recently thinking about that in the context of the mapcount >> ...) >> >> It's all a big mess ... > > Oh, I agree, there are always going to be circumstances where we realise > we've made a bad decision and can't (easily) undo it. Unless we have a > per-page pincount, and I Would Rather Not Do That. I agree ... But we should _try_ > to do that because it's the right model -- that's what I meant by "Tell Try to have per-page pincounts? :/ or do you mean, try to split on VMA split? I hope the latter (although I'm not sure about performance) :) > me why I'm wrong"; what scenarios do we have where a user temporarilly > mlocks (or mprotects or ...) a range of memory, but wants that memory > to be aged in the LRU exactly the same way as the adjacent memory that > wasn't mprotected? Let me throw in a "fun one". Parent process has a 2 MiB range populated by a THP. fork() a child process. Child process mprotects half the VMA. Should we split the (COW-shared) THP? Or should we COW/unshare in the child process (ugh!) during the VMA split. It all makes my brain hurt. > > GUP-pinning is different, and I don't think GUP-pinning should split > a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing > tcp zero-copy or it's the source/target of O_DIRECT. That's not an > instruction that this memory is different from its neighbours. > > Maybe we end up deciding to split folios on GUP-pin. That would be > regrettable. That would probably never be accepted, because the ones that heavily rely on THP (databases, VMs), typically also end up using a lot of features that use (long-term) page pinning. Don't get me started on io_uring with fixed buffers. -- Cheers, David / dhildenb