From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5E107EB64D9 for ; Fri, 7 Jul 2023 19:27:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D2DBB6B0075; Fri, 7 Jul 2023 15:27:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CDB818D0002; Fri, 7 Jul 2023 15:27:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF1A78D0001; Fri, 7 Jul 2023 15:27:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B0DBD6B0075 for ; Fri, 7 Jul 2023 15:27:00 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 844EBB088F for ; Fri, 7 Jul 2023 19:27:00 +0000 (UTC) X-FDA: 80985798600.19.39A7711 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf30.hostedemail.com (Postfix) with ESMTP id 9E43480004 for ; Fri, 7 Jul 2023 19:26:58 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=mXTbUm2B; spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688758018; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZPIlCKCzgc/KoAPeY2S749EwVAIG2D9R9gSZjauUTRo=; b=zlpPl/cp63iDjKD0SAnFsVfr0pX+UROvs109Ry0uzWewcSUW7L/2QnKI+XfuofKLBxpUeb J9HYVDyZf2FGUCaH3+5vVYYYP1I/ngzAppo8Y634jlI7bJYHWHGxXhlqZWYUcnKISO5UcG 1s9cn7TKxQbIh9472pp54Ok0JVibaoI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688758018; a=rsa-sha256; cv=none; b=kP4FKoi4rI7BFpJoLSRgnrJBmv1hUtfSF1Qb855XwI02wyhqpLlzEFW8oesWabd+pkxhFP t/NDOBKgToT+a9Drou//fUSWluA5yRULTm3fbHdVqHpMWXCdnbRKzl0R6A5ailFdNjblFH ulIiNZQ+DsD3a9pFpNQHyPyikuEdK1I= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=mXTbUm2B; spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=ZPIlCKCzgc/KoAPeY2S749EwVAIG2D9R9gSZjauUTRo=; b=mXTbUm2BAIX5AHalbXouzGdBBv Cjj0LUZSA7dzFAd+xZi4ahjjHwaW+V3Yh+3Ne3IDos63zgXpbb9IJId5sHqIlRTLUBYjw1jpHEV6Z b8ASSXooXoxLWY40EaBS7srqevhmu+3G3FfefBpXCpu+fiGO2KvYIArAF6zeyHGCJLQfLbSPTxbZD 4DvFnuix5S19LEvMhB/xyKkRDxKPD4gxJ7fsv83qao3OCD3vghQTD1C5BJa4scn5K6XIwc8w3PtoZ C4+3CaDf/GXtTDny/14Ofpy3pJ8yQOdtM4wJjVh8NEROT6guoDMd3ZIWXZn+vGbvnrP65g+P+Gb1d A9J+O/hg==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1qHr6S-00CIHG-4H; Fri, 07 Jul 2023 19:26:52 +0000 Date: Fri, 7 Jul 2023 20:26:52 +0100 From: Matthew Wilcox To: David Hildenbrand Cc: Yin Fengwei , linux-mm@kvack.org, linux-kernel@vger.kernel.org, yuzhao@google.com, ryan.roberts@arm.com, shy828301@gmail.com, akpm@linux-foundation.org Subject: Re: [RFC PATCH 0/3] support large folio for mlock Message-ID: References: <20230707165221.4076590-1-fengwei.yin@intel.com> <4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com> <5c9bf622-0866-168f-a1cd-4e4a98322127@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5c9bf622-0866-168f-a1cd-4e4a98322127@redhat.com> X-Rspamd-Queue-Id: 9E43480004 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: n58ihzkh9sb1qqegkui9metfqxbountn X-HE-Tag: 1688758018-552754 X-HE-Meta: U2FsdGVkX1+uXGaBrCPq75XWrjLwyTe/z2sP17Y1O4TvTkoCZiUyhXIzJPY7fljsLmZhIgOz0s7Lj9hgze2tRwPSZDabanu/ExRf7tg5dqwSUHaPWCGtRZTFyVwIbB8kMapYAlgeEm6jgCPoXYKavAGcr0HaRwd8pCZpLx+KzgIB6uRdvAjiP1xKGc3RITWMIxghSz5sEIvkrrp6bCbrosE6zuuuKV+3Qdw7eRRih36neYORaUOj3yeLbzSjaUh2kZxRXPZ8L31Vl1Mtqr3lg5Fr62LKzf1B6ozEH3jHUEjihsV9p7vJtity9v2tAP9ZFuBOiAoWW63MwjIZknwtjs1tSpecxbwknb+zBrTZ/NLWjfI/sZoZKECP2tJbVhkUGNJKmmT7uMs1iS2ZnzD9XkOqgwbo8yIKjGPsTAKfnjMBdLuxjBKXAXxoHfKJsZFEBRgMpkXcPNBReCTjgAAtFeyD7il1mJMD0DeybOxQwJXMoZuZEgl7o0eG1eIzbM7SqXU+BK76mDHEmFouy71/fGaiZ5K5O/9BfW+SHceiC3D7H5sisbMyx9KwmbSFqXWIgdF7tgY+dYp9WUhWw3PB9COOcRtZCeLALxqhrH5qnTiRcaEcJpA2tlIjjSfmtXP6JtbAWCV49ufHOHKa2K10cQnCVuff+9Dv7e8pWxNjAGr1muHPhqyS7Er8pO8gQ/EPzdSTXwmNOoh5ZmQS/Q9VoHxRgFPTsMBgelesyrKZKNjtigqJZFzryKyv+VRulleWWA+2IgbrqUgHy5u3FrjqtgrHnSRaZcHDZTSsgIMEu12W6CHKEWSOC5KVjj1qk58VrmTTPpVr+nlGrWtGnY3ryALx2Y+5M6FpwyWFKQr0pZnJw7IFb7etk1d5CgjDDV0M534BljBKll6g49mLbcJvBlcxnMtqVtvAYjvj1sCbCDCBLN/R6a5deoBktvDvb/UzxVvGrEgqinCrfX14T53 AykB6Rbc dOS1t4cP0++Ho7zTmdL0e3narL0D0tXq/IvMcYdxmWtXEc/bBMKViH52jeh1Zcqcoe+FAm9jgt5MoLIVrzzdO+J01HWcg9Xsqd9sKOIFLymyxKOf5ojJQYN4dFTMe6nlTRvjbrOYSS2/syYLaYz7YTrgkKlRubTphXKJxa2cJm7tD/meYuKSVVnnHetMMWgjHq7nyKs2XouNtiRoYz4WQeUMtotXwrPEHulLUdeg5PxZah/uffo13pn7SGyrVG0W5v/s1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 07, 2023 at 09:15:02PM +0200, David Hildenbrand wrote: > > > Sure, any time we PTE-map a THP we might just say "let's put that on the > > > deferred split queue" and cross fingers that we can eventually split it > > > later. (I was recently thinking about that in the context of the mapcount > > > ...) > > > > > > It's all a big mess ... > > > > Oh, I agree, there are always going to be circumstances where we realise > > we've made a bad decision and can't (easily) undo it. Unless we have a > > per-page pincount, and I Would Rather Not Do That. > > I agree ... > > But we should _try_ > > to do that because it's the right model -- that's what I meant by "Tell > > Try to have per-page pincounts? :/ or do you mean, try to split on VMA > split? I hope the latter (although I'm not sure about performance) :) Sorry, try to split a folio on VMA split. > > me why I'm wrong"; what scenarios do we have where a user temporarilly > > mlocks (or mprotects or ...) a range of memory, but wants that memory > > to be aged in the LRU exactly the same way as the adjacent memory that > > wasn't mprotected? > > Let me throw in a "fun one". > > Parent process has a 2 MiB range populated by a THP. fork() a child process. > Child process mprotects half the VMA. > > Should we split the (COW-shared) THP? Or should we COW/unshare in the child > process (ugh!) during the VMA split. > > It all makes my brain hurt. OK, so this goes back to what I wrote earlier about attempting to choose what size of folio to allocate on COW: https://lore.kernel.org/linux-mm/Y%2FU8bQd15aUO97vS@casper.infradead.org/ : the parent had already established : an appropriate size folio to use for this VMA before calling fork(). : Whether it is the parent or the child causing the COW, it should probably : inherit that choice and we should default to the same size folio that : was already found. You've come up with a usefully different case here. I think we should COW the folio at the point of the mprotect(). That will allow the parent to become the sole owner of the folio once again and ensure that when the parent modifies the folio, it _doesn't_ have to COW. (This is also a rare case, surely) > > > > GUP-pinning is different, and I don't think GUP-pinning should split > > a folio. That's a temporary use (not FOLL_LONGTERM), eg, we're doing > > tcp zero-copy or it's the source/target of O_DIRECT. That's not an > > instruction that this memory is different from its neighbours. > > > > Maybe we end up deciding to split folios on GUP-pin. That would be > > regrettable. > > That would probably never be accepted, because the ones that heavily rely on > THP (databases, VMs), typically also end up using a lot of features that use > (long-term) page pinning. Don't get me started on io_uring with fixed > buffers. I do think that something like a long-term pin should split a folio. Otherwise we're condemning the rest of the folio to be pinned along with it. Short term pins shouldn't split.