From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81A88C636D6 for ; Tue, 7 Feb 2023 16:51:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E270F6B010A; Tue, 7 Feb 2023 11:51:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DD69E6B010B; Tue, 7 Feb 2023 11:51:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9DE26B010C; Tue, 7 Feb 2023 11:51:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B5E926B010A for ; Tue, 7 Feb 2023 11:51:46 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6F04A160883 for ; Tue, 7 Feb 2023 16:51:46 +0000 (UTC) X-FDA: 80441087412.08.8C056E4 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) by imf07.hostedemail.com (Postfix) with ESMTP id BC50440017 for ; Tue, 7 Feb 2023 16:51:44 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=TqMSrqd9; spf=none (imf07.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675788704; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mkRPzyBoQVAH2HGdV9mCyjwQn5HBvr3h9hKqXKPtbDE=; b=sZnWjN5Of5hJMZhUx14TXb0yuuJNoGg+HAZZVKAE15i8v8803uYwfmeIr4M5hEEpHV+8MF 5BkyElXaHqaK+PkT02IIM5J5yPaomQiF8vBwT3dHIIV6Hn3cLxsp8AzZhwo+ihymC1FyN2 7xzhof/F5bMO3ksm3TJWDcvgIsHzNC4= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=TqMSrqd9; spf=none (imf07.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675788704; a=rsa-sha256; cv=none; b=dq2XYHPH4qt8OtXRKOlKZCiGY129ercbFPTOoYvL2jkSGQmUmwxtH3ESlz28i6Jr4os3OD 84ejh0cGd2WPOhOMvm4qmj7qhg2HbxskmyQjn0hq07wU0VomLlCY+PF6EmErCl9Zsx8pjV hWPJivTO2TT4Gm5EK3rT/+RlX6KZYr8= DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=mkRPzyBoQVAH2HGdV9mCyjwQn5HBvr3h9hKqXKPtbDE=; b=TqMSrqd9HrvEfP8cSTn9sL9PZL 2xos4KXGm7sZ44wyfbTGmstMHERe0JIXDFBhpMSmlU8gvsl20h9PVPDpYex7zwQTwdeqsVhEhoqUi LgSw/KRwhia4lg0v10YZNC7rNtCimXfG4VZMnjwxUN6t+5sLS1hygHdPg8qgwBmU+nU+PDjF+LcxZ O2BAOrXb/suJEi+bkQOwFDQNnVYRJ7r0AsypaYzhNHInB/Whh0ZGzhQMF+Y+QVmPu+DPTVLtzKBiN kMpLfeFtaBrFLouKvquXzxAbDmB+KdyGeipUCLaEY79GWSRCePMxQUFYBAIij0WcQSt72mnnULukx zMLFz5wQ==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1pPRC0-000OML-Rs; Tue, 07 Feb 2023 16:51:40 +0000 Date: Tue, 7 Feb 2023 16:51:40 +0000 From: Matthew Wilcox To: Zi Yan Cc: linux-mm@kvack.org, Vishal Moola , Hugh Dickins , Rik van Riel , David Hildenbrand , "Yin, Fengwei" Subject: Re: Folio mapcount Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: BC50440017 X-Stat-Signature: jfq6bkjbwrhpknctdsjhx3c9yy349foe X-Rspam-User: X-HE-Tag: 1675788704-630880 X-HE-Meta: U2FsdGVkX1+9NQjLwzYE5Z5tEBEg3xWkRZaA+yrMgWseCXf49jsLHqcUo/4BxvyprkJKCSN06z5oyaLXlxsvr8ryevDMUVJZYlnkXbuZDi9gdW+50525OqOIN0xODGRL4pNj2ZGOM8AO7iNrFNXMe1gjG/gXVXVOgdRBSkhsOuRS0pVvneLnkMUyW0soszzkWpSKVR9YOeyykVGPEdpdQUIzhqSV1tzI/GiiBz1A0lrJoBLqksfCfgfexBLKKLaGGu6yLA0YnEHs1ZxUdU4SpeLWdZ/DjQ4nYRttxqCnH3K89ZHusFIWd8LjHp8Fsx/X/0EvjW3TwuoV9w1uuOz/4r/JDLs5ZyfPjDDt8bFPq+hgiQZU3k6WhU9bUFqhe1N/9zVU8Xo/7M5GiteW+xGBx46nwgVx8faUwf1HY6CyzRmptMqu9Xx5SBOjsej/alcIOT+4Eonb1ZvnatrC+/y6NtrUt7AbpHoD/WkSQAKjQ9iH+J5YRgirIpnUfha/knA/T2A/E0jOTnAaGQ987M23TSI1wOUi6LADHQZFK0Ys3hQ82Ugeg5c+5BSqRCaHg57su1yT/Z8AXuc+peORu6IDKxlT0Kt1L7oFLp0h+E7y9dCg/xC9g7kPfib1Ka0TJJPBHvAi3uBW7mY0dy6Z9k2HD9YwAL2SmNm8yZzFvG61T5GlbhchRNpclBXZAr9HZRZQ/KQ3cJtKgeBbe4H7mey6ZHCzKj3Wpi34GtGNIGRpTB9FBRi/3m9tC5tYRYEHfxwd6TCzU3mfdUJ4mKbkDYimD4iQENeZ6aLJutNE/aqvZuOPtUSozlgFoxQXldqVymvu+wTr0XFINkAi/EF/aOJogeA2Pd5JteD4V4nDTpJdjHStSRJTi+3xDwgh016en6KX+Gq6kQAEhGVTyaXGA2Ybox5/isGUdhU0aBQvyr7/RTJCaUNyjSk+lYXuBb6i5BMMsE/q097LNwHZ2FGiHF2 cmwtcV7K eR9d1U5pBFBzifSqhT5Vs3XROfTOcfoJdr0WHIFibwFK+Ob+S0UxjI6VtsIT8/ikpMcNWrzwPEXrV/o+Gn+5sZpZ7boXc2jq4lK5dEd8N9wPTx789RSR0URaoInW9mTThKE4LnwIBD8ok4FgdQXORZ6wT3ZyhUui19KliVQd5LYTNHN8HokmORMUl64aYcmMrdTIwoNBp+gaoIfJOFl9WGZmZSU1Y9USF9gxE7rm4JHjcFd8VFLyCsbvP/J2YMdvvs5RxZJyxAtBRJGpj0WD9WDueb7O0zw1vyQOK8vyEp8IMNbkHWAORg7pYlUBWhELq0zIiMrjTsmm1hCETaBWIdfiEdA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 07, 2023 at 11:23:31AM -0500, Zi Yan wrote: > On 24 Jan 2023, at 13:13, Matthew Wilcox wrote: > > > Once we get to the part of the folio journey where we have > > one-pointer-per-page, we can't afford to maintain per-page state. > > Currently we maintain a per-page mapcount, and that will have to go. > > We can maintain extra state for a multi-page folio, but it has to be a > > constant amount of extra state no matter how many pages are in the folio. > > > > My proposal is that we maintain a single mapcount per folio, and its > > definition is the number of (vma, page table) tuples which have a > > reference to any pages in this folio. > > How about having two, full_folio_mapcount and partial_folio_mapcount? > If partial_folio_mapcount is 0, we can have a fast path without doing > anything at page level. A fast path for what? I don't understand your vision; can you spell it out for me? My current proposal is here: https://lore.kernel.org/linux-mm/Y+FkV4fBxHlp6FTH@casper.infradead.org/ The three questions we need to be able to answer (in my current understanding) are laid out here: https://lore.kernel.org/linux-mm/Y+HblAN5bM1uYD2f@casper.infradead.org/ Of course, the vision also needs to include how we account in folio_add_(anon|file|new_anon)_rmap() and folio_remove_rmap(). > > I think there's a good performance win and simplification to be had > > here, so I think it's worth doing for 6.4. > > > > Examples > > -------- > > > > In the simple and common case where every page in a folio is mapped > > once by a single vma and single page table, mapcount would be 1 [1]. > > If the folio is mapped across a page table boundary by a single VMA, > > after we take a page fault on it in one page table, it gets a mapcount > > of 1. After taking a page fault on it in the other page table, its > > mapcount increases to 2. > > > > For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the > > PMD into PTEs would not change the mapcount; the folio remains order-9 > > but it stll has a reference from only one page table (a different page > > table, but still just one). > > > > Implementation sketch > > --------------------- > > > > When we take a page fault, we can/should map every page in the folio > > that fits in this VMA and this page table. We do this at present in > > filemap_map_pages() by looping over each page in the folio and calling > > do_set_pte() on each. We should have a: > > > > do_set_pte_range(vmf, folio, addr, first_page, n); > > > > and then change the API to page_add_new_anon_rmap() / page_add_file_rmap() > > to pass in (folio, first, n) instead of page. That gives us one call to > > page_add_*_rmap() per (vma, page table) tuple. > > > > In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for > > each pfn. We'll want a function like > > page_vma_mapped_walk_skip_to_end_of_ptable() > > in order to persuade it to only call us once or twice if the folio > > is mapped across a page table boundary. > > > > Concerns > > -------- > > > > We'll have to be careful to always zap all the PTEs for a given (vma, > > pt) tuple at the same time, otherwise mapcount will get out of sync > > (eg map three pages, unmap two; we shouldn't decrement the mapcount, > > but I don't think we can know that. But does this ever happen? I think > > we always unmap the entire folio, like in try_to_unmap_one(). > > > > I haven't got my head around SetPageAnonExclusive() yet. I think it can > > be a per-folio bit, but handling a folio split across two page tables > > may be tricky. > > > > Notes > > ----- > > > > [1] Ignoring the bias by -1 to let us detect transitions that we care > > about more efficiently; I'm talking about the value returned from > > page_mapcount(), not the value stored in page->_mapcount. > > > -- > Best Regards, > Yan, Zi