From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E42A1C636CD
	for <linux-mm@archiver.kernel.org>; Tue,  7 Feb 2023 23:27:24 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6BD7B6B008A; Tue,  7 Feb 2023 18:27:24 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 66D036B008C; Tue,  7 Feb 2023 18:27:24 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 55C1C6B0093; Tue,  7 Feb 2023 18:27:24 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 46D876B008A
	for <linux-mm@kvack.org>; Tue,  7 Feb 2023 18:27:24 -0500 (EST)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 1B40E140C29
	for <linux-mm@kvack.org>; Tue,  7 Feb 2023 23:27:24 +0000 (UTC)
X-FDA: 80442084408.30.40CBE3F
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf30.hostedemail.com (Postfix) with ESMTP id 411AC8000A
	for <linux-mm@kvack.org>; Tue,  7 Feb 2023 23:27:22 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b="M6zR/e1j";
	spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1675812442;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Uv/HknAlZzCiGOhwSkkfv4AAd5mctHNI6ZZZwpbzq2c=;
	b=i1tL2ma27OxRDM6l8DImH5FEzFZlLf9CznQhfOStfTAZXW/Cd4X/moM/k3Wf/STnGiECGB
	BmLp8yyuUN+bSklfE9ClfefsI8ma7POmsn7U/Y+nL6or4uX7VlucFJj0jubuyCkWT3ASGo
	tJuioxjRkhLuVr7rTZPmwO4AjFDNeLE=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b="M6zR/e1j";
	spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675812442; a=rsa-sha256;
	cv=none;
	b=YqtBn3gUaDdUL6D5oYo9KVv7Rktt2YJyHwxqM0ka76I+8xSllTWA6YrHTeFvRBIrntZQ9D
	ASCIn/dMqEt5z8b9ibriwTF/D3G3ggQIES1EiuJX+dFDkXi4AqEW/UzIDO2i1DYdw4u99U
	yvpRd1W9tGwG7Kl2OoH9jN6qaBPOa5Q=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=Uv/HknAlZzCiGOhwSkkfv4AAd5mctHNI6ZZZwpbzq2c=; b=M6zR/e1jS5ge7SrOa9s1v7CcuE
	1cRlmeRKED1pSduCykNJtVq1/bXnTP2N7nRSsTTGYAROOftvGKPmLzxuUJYqQ3wg/Ms/meei8jsgl
	vkCv4WA5+Jwrv9PfH0/JngReOW+Ow73J9WWoiDxZ+jjU26mv55HC/W1Dl06oZJ0zpo336gWjK/e3Z
	5o9+ofIbO+OoyY9r6V0zaOBzplsFc10mZ6rNv3Jt8woa61IpmQ1kOAQDPH88eR6Qmn5h6XFKVweod
	AuMf/v/IRSk8him/kDE7JG1pOZVkh4QbSvzQizxcy+sDGJ/0q8y6nXhNWpf+LNSTpdXKsE9KAf9vm
	017/b36A==;
Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux))
	id 1pPXMr-000fAZ-9m; Tue, 07 Feb 2023 23:27:17 +0000
Date: Tue, 7 Feb 2023 23:27:17 +0000
From: Matthew Wilcox <willy@infradead.org>
To: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org, Vishal Moola <vishal.moola@gmail.com>,
	Hugh Dickins <hughd@google.com>, Rik van Riel <riel@surriel.com>,
	David Hildenbrand <david@redhat.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>
Subject: Re: Folio mapcount
Message-ID: <Y+LeVfOatfdbn/3F@casper.infradead.org>
References: <Y9Afwds/Jl39UjEp@casper.infradead.org>
 <Y+FkV4fBxHlp6FTH@casper.infradead.org>
 <Y+LTCxmfMnJjgv/n@x1n>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Y+LTCxmfMnJjgv/n@x1n>
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: zzahh8u3rc193zn7m9bdynp8bas1eke1
X-Rspamd-Queue-Id: 411AC8000A
X-HE-Tag: 1675812442-195299
X-HE-Meta: U2FsdGVkX194GtpTvcI5IIgWXEiRNujKQzg9+OTh5bAkWwCa18lf7EquWagHSP9qovox0dJwILx4z6Lv35+DCWA+bJ1OjGKCzQKw0PyVUe/iAdq7O6bvSkLLsOnk21RjtaHhXizESlpHXPW1Tfvoa+rk9x3b4Ovo+xp3poNwCU4QKkzUYB0Ms7xUBoYPOQ/QsCRrMk4O9KutlwqWV3MpVIzYqcUBOQzrTlXKJkS/GphAQ70kj+4KabAj2t3fhtiTXfkG0+ulKU5Objqptv02TsRTYSAUTlAnn8xDeeIGj42zwsmpc0NsikMZRz39oZGxu71rrb6rFqYqWHlwJFLDd9R8oKewdQiUQJui3atkbgiX1YqI4K6ksB1ks1F7Dl0OgWtElfyk4bDCA2LYmuUZjCWKvrW4OH4DSOzxKKF9iV3NU2r7Uci254hSsdj24lxFdDLfgsmhF1bIHpMqp4LzLHfHM1iIaJMdylLzKi2O3nEISX9fZ7gTa9dfDrj1N1gW2I7UUdq5IskeL+r/ywjj/f8lmCNxVLiz+mSt+Mqyn3RMPDlTsDUMJdFCHLSp7YFuwLN6UL458cDT5TGJXqXLxFPehLCaztIo0z1LJF16FIjKNZOM2IOwqT8wIlns7CIEAZm4ZvypLyoK3zn0tWe+NrvBPqokGeA3VjGF3FqypAPjvHlDBJvYEwp9JNYX5ZDjUwWNLZ+m5YWyS+kB4klU+VHJfV+r7nKZ/GNzJCRLHJsOlgMb0sFelZFLxr7oVi8m0g79T5EAligbdprN6lJZQFokDkYP9mRDE+MCzCIzJLnYBR2Rw+sJPpUENZEWcgpQ4kz1QuZvCuhYu+2pxvC2OsX0FwHzyqlSZSeF18kfTQZeI1uS4yHODCEfmPDS4o38qDe9eFLtslNiZuyUhna8ow9ecrQQKZ+J/ytTAor6EMbzE1MxBvCQI7oWdw0ZiUl9cqOn5ecceG/tPnOAEP2
 xfaeR8yu
 L7ztVJSdhC+sKZz0WxYO4NFFNQ2ywzJtZY4sZRhmS/tL3xcjTBqX6KNhb3Bw5dR0J622ZgjBlLZrSRi3JzWRA+QTybBBNpUhwzZSWDPRSNBS0oCcbQQwKLc8tDQBzOcQoJnmhnqSEtkMzBLaTq4jfHwMnS2InXQNIXtxx2oX8RlPyu/BSluLeoJlgqiTVBSDIpr5pNaFY7PeSG0rzMxCLzDHk/CNluOyfp//O
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Feb 07, 2023 at 05:39:07PM -0500, Peter Xu wrote:
> On Mon, Feb 06, 2023 at 08:34:31PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 24, 2023 at 06:13:21PM +0000, Matthew Wilcox wrote:
> > > Once we get to the part of the folio journey where we have 
> > > one-pointer-per-page, we can't afford to maintain per-page state.
> > > Currently we maintain a per-page mapcount, and that will have to go. 
> > > We can maintain extra state for a multi-page folio, but it has to be a
> > > constant amount of extra state no matter how many pages are in the folio.
> > > 
> > > My proposal is that we maintain a single mapcount per folio, and its
> > > definition is the number of (vma, page table) tuples which have a
> > > reference to any pages in this folio.
> > 
> > I've been thinking about this a lot more, and I have changed my
> > mind.  It works fine to answer the question "Is any page in this
> > folio mapped", but it's now hard to answer the question "I have it
> > mapped, does anybody else?"  That question is asked, for example,
> > in madvise_cold_or_pageout_pte_range().
> 
> I'm curious whether it is still fine in rare cases - IMHO it's a matter of
> when it'll go severely wrong if the mapcount should be exactly 1 (it's
> privately owned by a vma) but we reported 2.
> 
> In this MADV_COLD/MADV_PAGEOUT case we'll skip COLD or PAGEOUT some pages
> even if we can, but is it a deal breaker (if the benefit of the change can
> be proved and worthwhile)?  Especially, this only happens with unaligned
> folios being mapped.
> 
> Is unaligned mapping for a folio common? Is there any other use cases that
> can go worse than this one?

For file pages, I think it can go wrong rather more often than we might
like.  I think for anon memory, we'll tend to allocate it to be aligned,
and then it takes some weirdness like mremap() to make it unaligned.

But I'm just waving my hands wildly.  I don't really know.

> (E.g., IIUC superfluous but occasional CoW seems fine)
> 
> OTOH...
> 
> > 
> > With this definition, if the mapcount is 1, it's definitely only mapped
> > by us.  If it's more than 2, it's definitely mapped by somebody else (*).
> > If it's 2, maybe we have the folio mapped twice, and maybe we have it
> > mapped once and somebody else has it mapped once, so we have to consult
> > the rmap to find out.  Not fun times.
> > 
> > (*) If we support folios larger than PMD size, then the answer is more
> > complex.
> > 
> > I now think the mapcount has to be defined as "How many VMAs have
> > one-or-more pages of this folio mapped".
> > 
> > That means that our future folio_add_file_rmap_range() looks a bit
> > like this:
> > 
> > {
> > 	bool add_mapcount = true;
> > 
> > 	if (nr < folio_nr_pages(folio))
> > 		add_mapcount = !folio_has_ptes(folio, vma);
> > 	if (add_mapcount)
> > 		atomic_inc(&folio->_mapcount);
> > 
> > 	__lruvec_stat_mod_folio(folio, NR_FILE_MAPPED, nr);
> > 	if (nr == HPAGE_PMD_NR)
> > 		__lruvec_stat_mod_folio(folio, folio_test_swapbacked(folio) ?
> > 			NR_SHMEM_PMDMAPPED : NR_FILE_PMDMAPPED, nr);
> > 
> > 	mlock_vma_folio(folio, vma, nr == HPAGE_PMD_NR);
> > }
> > 
> > bool folio_mapped_in_vma(struct folio *folio, struct vm_area_struct *vma)
> > {
> > 	unsigned long address = vma_address(&folio->page, vma);
> > 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > 
> > 	if (!page_vma_mapped_walk(&pvmw))
> > 		return false;
> > 	page_vma_mapped_walk_done(&pvmw);
> > 	return true;
> > }
> > 
> > ... some details to be fixed here; particularly this will currently
> > deadlock on the PTL, so we'd need not only to exclude the current
> > PMD from being examined, but also avoid a deadly embrace between
> > two threads (do we currently have a locking order defined for
> > page table locks at the same height of the tree?)
> 
> ... it starts to sound scary if it needs to take >1 pgtable locks.

I've been thinking about this one, and I wonder if we can do it
without taking any pgtable locks.  The locking environment we're in
is the page fault handler, so we have the mmap_lock for read (for now
anyway ...).  We also hold the folio lock, so _if_ the folio is mapped,
those entries can't disappear under us.  They also can't appear under
us.  We hold the PTL on one PMD, but not necessarily on any other PMD
we examine.

I appreciate that PTEs can _change_ under us if we do not hold the PTL,
but by virtue of holding the folio lock, they can't change from or to
our PFNs.  I also think the PMD table cannot disappear under us
since we're holding the mmap_lock for read, and anyone removing page
tables has to take the mmap_lock for write.

Am I missing anything important?