From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88C6AEB64D9 for ; Fri, 7 Jul 2023 19:06:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DCECE8D0001; Fri, 7 Jul 2023 15:06:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D7F0A6B0074; Fri, 7 Jul 2023 15:06:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C946A8D0001; Fri, 7 Jul 2023 15:06:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id BAF376B0072 for ; Fri, 7 Jul 2023 15:06:09 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 608B71601F8 for ; Fri, 7 Jul 2023 19:06:09 +0000 (UTC) X-FDA: 80985746058.22.33C347B Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf03.hostedemail.com (Postfix) with ESMTP id 1971220009 for ; Fri, 7 Jul 2023 19:06:06 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=N8ZEPVxH; spf=none (imf03.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688756767; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VcC79p5FnMDwACqVyoSz3qMUxvEDx9IUffDrpYxsszc=; b=fXz3ou1JB1BlDEKlQdkx2ZXu07s0/YkVhtWqWMcJ93RBC9UNIO14OH9jHbqIqhbnHN465j FU4540L+neSZXPTcU1LI3OVh14Lgz5gazVa31NzHkRb40F1r5vYwrsqijobLvAIc+HfJzG KTBFFyLJXbefaRK/VL38w4vpkvazDQk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688756767; a=rsa-sha256; cv=none; b=FT+hXfzPJDYGzM9aTED4wtfGpTHcT5coaDitsaPcs/qfsijWscy14pL8oyL1PeLz7xdOHj 5RAjwbnvWlok7Uk7SKHcedMOVbqL6w5/vEOilhDp+aPhit6gRgnm9k6DgoHSe9j39cHhW1 RdiN2QeRqyRUTzPVaghoxvS9udD6VqU= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=N8ZEPVxH; spf=none (imf03.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=VcC79p5FnMDwACqVyoSz3qMUxvEDx9IUffDrpYxsszc=; b=N8ZEPVxHT83vnm3M+WmDnfzVB6 dhEiuQUir9TbVm5ORKOEIqSiak5ggPVHFoveW3X0jNSxGZ6dvIqWkZy09g8SjKrKgMn/igz+oRIj2 FbZBWcko1aH6e7fJJflty4mm1y+OdN76mXh/Yuu8g5azTi6aQZFXiflsb4rV7alUEEGKyln9vVTdZ 2+i3q4Z6FBzDkB3NFJUgSPOzX79bsOr1EaUB4wMoIN1Q7Xg01y0PRRPc/IUOF3SGKWBBzCWTKfwlG tKVvAg4LpUrWI48B71r2gfZ+cjDkQ3nqIWCzLOC+Cg80MBsmFYSbWqxMPjGYNf5Vr92B3R7XdNnmB hxeE8I3Q==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1qHqmG-00CHRg-A1; Fri, 07 Jul 2023 19:06:00 +0000 Date: Fri, 7 Jul 2023 20:06:00 +0100 From: Matthew Wilcox To: David Hildenbrand Cc: Yin Fengwei , linux-mm@kvack.org, linux-kernel@vger.kernel.org, yuzhao@google.com, ryan.roberts@arm.com, shy828301@gmail.com, akpm@linux-foundation.org Subject: Re: [RFC PATCH 0/3] support large folio for mlock Message-ID: References: <20230707165221.4076590-1-fengwei.yin@intel.com> <4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com> X-Rspamd-Queue-Id: 1971220009 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: drpty6cs8rig4n3suzf468cu7qce7136 X-HE-Tag: 1688756766-158546 X-HE-Meta: U2FsdGVkX18rF2i+n6WVBoWJSwkWQNk3b1rQdarswvEsJ2uIYiw0onCAOQopFiJBIexx2Q7blkSWeMXIB2+BXmzuzRdh4bd9Z/5fWwStedSoQsniLMLoRTSRxsZf+Z0zakwLfcOWbcE55itnI6PJ+4kurhRTMfbX33e8W+lczo6MR7IXchOxuyWEUYOJB+Og4lgAwtxEPuR0oKX/O6US/n1CbeslxWACPffEEGNe44ymucsNpYGdGZVa8rjSqyBA6A+QDnn8l/cNtiiHhdrNBWdqWRZ1JWA0WnriOKP7eYvLhr9I9os+H/lmQap+FSb3cYXkiuYD/m+rolc91Y1fua+0oDN2ZvA4PM6YzL+d9VAcG2pRz46QcGAKCbsloeGEF+1rnV1/RZOgBEgtILv/Qy56ULZ90GAovQRlm+dSe8onH2zZpl2j9LKSckymefic/xpuHYydaP4y6k2O4+UbcaRPTNZfAKwVehniiKun80Po7m0CCY4uzuvdGLI6HgoKuDa1Z+YQR4jtfpTwdYD99UniW9lCHCTR/mdURMB6dhRVwBgm8r9dSi3WOZSM/CNDuVf0kC5hgxXLMIWyTfZv10/YJfobmVLeStmPfXv+NEmk41H0VoVUhQRAFIk4sBvXDikllz0lI4vkq7o22BrRT7Ajx3TttURp9/4EFW4cSYQLw8pk394Hcb6niIcURqdhHogEsJ935T9GVLXqOy/X1zMGNd0bwMCCEf5H8AI9RhY8Ns2QZnLJZDZzePD9P3eCN7bkk4qu4iggEfr5ypU1+Ctze0MiCbJtWM6t7ff90kEiGQPuHosOGD9wDgHpNvp/6Z1iRAFbHZ2NWimHZuYgmuDckKHdIC47G7ziYw5QMD6VYNzg4mhRsiPwcEtty9/nBhK+6e1U2nVvN3e3pP5oLkEcZMzIcUVoPO0fhkRbscC3ztMtwvauy2GoN35VOe2VHtUNfuGVGgtxG/cR5eF C1ijlLkZ W78SEDPbeiLa66/8dZzeYp/FANt8sodhLCtnUU/ZBy3F8KMfNrG5dUFOn8C6ISswb8bpRgBPinyelDGZ9H08a1qsykdZW4eI730s6Zn58LNcehiBG8KO/VBiYUYkDUUcT8UhNEe4PqGuZsg0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote: > On 07.07.23 19:26, Matthew Wilcox wrote: > > On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote: > > > This series identified the large folio for mlock to two types: > > > - The large folio is in VM_LOCKED VMA range > > > - The large folio cross VM_LOCKED VMA boundary > > > > This is somewhere that I think our fixation on MUST USE PMD ENTRIES > > has led us astray. Today when the arguments to mlock() cross a folio > > boundary, we split the PMD entry but leave the folio intact. That means > > that we continue to manage the folio as a single entry on the LRU list. > > But userspace may have no idea that we're doing this. It may have made > > several calls to mmap() 256kB at once, they've all been coalesced into > > a single VMA and khugepaged has come along behind its back and created > > a 2MB THP. Now userspace calls mlock() and instead of treating that as > > a hint that oops, maybe we shouldn't've done that, we do our utmost to > > preserve the 2MB folio. > > > > I think this whole approach needs rethinking. IMO, anonymous folios > > should not cross VMA boundaries. Tell me why I'm wrong. > > I think we touched upon that a couple of times already, and the main issue > is that while it sounds nice in theory, it's impossible in practice. > > THP are supposed to be transparent, that is, we should not let arbitrary > operations fail. > > But nothing stops user space from > > (a) mmap'ing a 2 MiB region > (b) GUP-pinning the whole range > (c) GUP-pinning the first half > (d) unpinning the whole range from (a) > (e) munmap'ing the second half > > > And that's just one out of many examples I can think of, not even > considering temporary/speculative references that can prevent a split at > random points in time -- especially when splitting a VMA. > > Sure, any time we PTE-map a THP we might just say "let's put that on the > deferred split queue" and cross fingers that we can eventually split it > later. (I was recently thinking about that in the context of the mapcount > ...) > > It's all a big mess ... Oh, I agree, there are always going to be circumstances where we realise we've made a bad decision and can't (easily) undo it. Unless we have a per-page pincount, and I Would Rather Not Do That. But we should _try_ to do that because it's the right model -- that's what I meant by "Tell me why I'm wrong"; what scenarios do we have where a user temporarilly mlocks (or mprotects or ...) a range of memory, but wants that memory to be aged in the LRU exactly the same way as the adjacent memory that wasn't mprotected? GUP-pinning is different, and I don't think GUP-pinning should split a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing tcp zero-copy or it's the source/target of O_DIRECT. That's not an instruction that this memory is different from its neighbours. Maybe we end up deciding to split folios on GUP-pin. That would be regrettable.