From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DEAD7C636D6
	for <linux-mm@archiver.kernel.org>; Wed, 22 Feb 2023 03:53:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1862A6B0071; Tue, 21 Feb 2023 22:53:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 136466B0073; Tue, 21 Feb 2023 22:53:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F40AC6B0074; Tue, 21 Feb 2023 22:53:00 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id E10DE6B0071
	for <linux-mm@kvack.org>; Tue, 21 Feb 2023 22:53:00 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id B3756A288E
	for <linux-mm@kvack.org>; Wed, 22 Feb 2023 03:53:00 +0000 (UTC)
X-FDA: 80493556920.18.2C35CFA
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf30.hostedemail.com (Postfix) with ESMTP id 821E48000F
	for <linux-mm@kvack.org>; Wed, 22 Feb 2023 03:52:57 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=TFBZF8n5;
	spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677037978;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=dOYl3SjwxOmZgzkyta2mx+9MAywhcAllv7ncRUJlsRo=;
	b=EOgtmNCc7ygU2FZiQ9pVuSRF7hVi6fOCR2GiXUQJxQh+kh2G+WHoBhI/6mKZs7WhnFeyph
	6VFFWROI1A2izSeL/4Mi0KRUsIQRSf8zjKh/1xLMCNBUPaveW62ZE/90Fmh3fLrmFnXG+c
	AHZ7DHbAOpxE1pxMjWDV+albMUrEwqg=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=infradead.org header.s=casper.20170209 header.b=TFBZF8n5;
	spf=none (imf30.hostedemail.com: domain of willy@infradead.org has no SPF policy when checking 90.155.50.34) smtp.mailfrom=willy@infradead.org;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677037978; a=rsa-sha256;
	cv=none;
	b=sVDgQqQfvk+fknCLo+Ev8d7w9P+1xYjNxtKUPydaTHJiE2Ymm36Jw7m5scWAZk1rUylJbz
	Knen8j3URZeRrUd/OX/8Yflgjeytw+LRu3/lE3Vxmfnv4Mnk04ji5rhJPC4ypjkedRrwoh
	akNWzQCZPQnc01pRIBdO4twaWz+Or2E=
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=dOYl3SjwxOmZgzkyta2mx+9MAywhcAllv7ncRUJlsRo=; b=TFBZF8n5EetVpuhq5Ay9jXh81s
	iMvH0WX9mrGWA/JHLmdGblgTZNQ0cs3D38RvB4qdHh5RS1K3ZN/7Nmabi9gD4OM1Eif5KK/DzPyRH
	oBSjdXY08Bxs2bw2WIad7XJgm/7wVOpt0V4ML7zYpauFpy70B6BVr29Lb8GNkTZxh0Xzurv/YAYY4
	i0U+n29f+IQ6uSB4uDWqi3dMAWRLe4NUCJqjdW2pDD90rkb2mbz1rJWdjBHuPEJrHJOYNxNAyHO3p
	/r1k0EM62tzlKdToqETEOZLf4yLkDm1ICN0LitVYTQU1aPgPLFGl5v+FiKuTsE5YZM3FcvvJP21N6
	FD74bpPA==;
Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux))
	id 1pUgBa-00DAKv-0O; Wed, 22 Feb 2023 03:52:54 +0000
Date: Wed, 22 Feb 2023 03:52:53 +0000
From: Matthew Wilcox <willy@infradead.org>
To: Yang Shi <shy828301@gmail.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: What size anonymous folios should we allocate?
Message-ID: <Y/WRlX+MkmxelNbg@casper.infradead.org>
References: <Y/U8bQd15aUO97vS@casper.infradead.org>
 <CAHbLzkrkZmbVMkh-Y-bDxgy0T0ZRRd+T+o5y5-wKmjKmhN0NmA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAHbLzkrkZmbVMkh-Y-bDxgy0T0ZRRd+T+o5y5-wKmjKmhN0NmA@mail.gmail.com>
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 821E48000F
X-Stat-Signature: wh5p9nsb9dt1x31bcgnis87cohewup64
X-Rspam-User: 
X-HE-Tag: 1677037977-395842
X-HE-Meta: U2FsdGVkX18hahGzT40dM0oK5VtnlpLQRQxEaKJwbYK33IeXicnVYOeAq9yyT8yNs52j48neTnLs/iLi2HSRPX92aKvKJmvr0/8LswzX48LoJSnw8AY8F4JH3pWMVHlapm3d9vSaX3UleTRgCNnnzitMI6KRbsxEBPi9Gj70MMwOVRA6ZdM5o4NHXI8p0UL0m5JfoS0azkJHFIXbXyK8ETf+2CX32r54D+o2NHM3tLpUgEwO+kb8bWdfAR1UOCOYDwgpk8nfS9J67JR/tCf+CBPj09hPb2GTRHXZSbM4f1x7iRhRb8XSEVn0gPvZ1S+8N0Wq9sQL0577TaVUpaLwadrbQIgEhn5pNWMQhvSOYwPIj3IaXiG20i4pLs9I0pooRAXmALlqUW43t5+A58sDrHiz7Xo7DScBXZFs3uKgtzxdXXAeIBx2EaWFMhmZndRPxIDbSiCvfLKftK0RWLAAABm0n9MF78B4hfP0dE/SSk/s50eJjJBdA66sWavSu7eQ0jAMgSfdUJ1Zz2YTETyeIPUsJx1jUXer8uY5dywfKwhHOK5ZADdXxRUrgJYAAZQQEydPQEeaxy+u8Om/SPHRDUeM5tPAbmSdpEEOE/5SVWnwHywvewGwwCA/pCuPb/Yn+5LbT+NC/0swreV6fBTgLtVzqeovEu2KTt+88Gc1wUBb37OR5f4dQmcmFlgzBMJBDscY4y71H+TV6nPk+nx9mCEWK29EKcpxVIs/DkeReRRsfv+W3heziU30Gb6ES9xag9Zm6I85zspjYzSWQlNNO+7rbyo2tPcbl9z1KD/0fYowOAMaByq82KCX6qSx2MrcZV6RckLMAUUKlZa2xfhvThq4397UQuY1dEB9WdZGhskZIG8LxAqiUVM5DLnLlSMwSZSQZOPExBY3ZR8GFAzNUY55a4oH35pq+Xyc8pCFAqIN8XZGF+ncygdL0SdiBgAPNQgejT2R70PkpixvTr3
 J1nlAZeh
 3wAV4w/gB71hcgvJVoiFQ300Qcnl9Xg52Qv6fJ28WPmGQ9sAB2zv5cBJvQsMsd/O5D0wvX4TK4oZ4u4SX7HyEfOb6Ig6umotHrjonbcEB52/Q//6CebSx7P8C2uVeF+k47WgR8x4FBVddub5CODRXAcIcXRr8HuJCvTHRjRcs24NtFU44Wja5kYm3Jb7iQQHb2hrDu1cRzAG/1SQgMIhIlDVi49rjxV2u/Z0cIQtpZtQ9tR3ywYTnlboT4dVaBv3x82DG
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Feb 21, 2023 at 03:05:33PM -0800, Yang Shi wrote:
> On Tue, Feb 21, 2023 at 1:49 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > In a sense this question is premature, because we don't have any code
> > in place to handle folios which are any size but PMD_SIZE or PAGE_SIZE,
> > but let's pretend that code already exists and is just waiting for us
> > to answer this policy question.
> >
> > I'd like to reject three ideas up front: 1. a CONFIG option, 2. a boot
> > option and 3. a sysfs tunable.  It is foolish to expect the distro
> > packager or the sysadmin to be able to make such a decision.  The
> > correct decision will depend upon the instantaneous workload of the
> > entire machine and we'll want different answers for different VMAs.
> 
> Yeah, I agree those 3 options should be avoided. For some
> architectures, there are a or multiple sweet size(s) benefiting from
> hardware. For example, ARM64 contiguous PTE supports up to 16
> consecutive 4K pages to form a 64K entry in TLB instead of 16 4K
> entries. Some implementations may support intermediate sizes (for
> example, 8K, 16K and 32K, but this may make the hardware design
> harder), but some may not. AMD's coalesce PTE supports a different
> size (128K if I remember correctly). So the multiple of the size
> supported by hardware (64K or 128K) seems like the common ground from
> maximizing hardware benefit point of view. Of course, nothing prevents
> the kernel from allocating other orders.

All of this is true (although I think AMDs intermediate size is actually
32kB, not 128kB), but irrelevant.  Software overhead is FAR more important
than hardware overhead.  If we swap out the wrong page or have to run
around doing reclaim, that absolutely dwarfs the performance impact of
using small TLB entries.

So we need to strike the right balance between using larger folios
for efficiency and smaller folios for precision of knowing which
pages are still part of the process working set.

> Actually I was thinking about the reverse, starting from the biggest
> possible order, for example, 2M -> 1M -> ... 64K -> ... 4K. And the
> page fault path should be able to use the same fallback order. But
> excessive fallback tries may be harmful either.

What's your reasoning here?

> > B. A further modification is that it will require three of the four
> >    folios being combined to be on the active list.  If two (or more)
> >    of the four folios are inactive, we should leave them alone; either
> >    they will remain inactive and eventually be evicted, or they will be
> >    activated and eligible for merging in a future pass of khugepaged.
> 
> If we use the fallback policy, we should be able to just leave it to
> reclamation time. When checking reference we could tell what PTEs are
> accessed, then split if there is significant internal fragmentation.

I think it's going to lead to excessive memory usage.  There was data
presented last LSFMM that we already have far too much memory tied up
in THP for many workloads.

> > C. We add a new wrinkle to the LRU handling code.  When our scan of the
> >    active list examines a folio, we look to see how many of the PTEs
> >    mapping the folio have been accessed.  If it is fewer than half, and
> >    those half are all in either the first or last half of the folio, we
> >    split it.  The active half stays on the active list and the inactive
> >    half is moved to the inactive list.
> 
> With contiguous PTE, every PTE still maintains its own access bit (but
> it is implementation defined, some implementations may just set access
> bit once for one PTE in the contiguous region per arm arm IIUC). But
> anyway this is definitely feasible.

If a CPU doesn't have separate access bits for PTEs, then we should just
not use the contiguous bits.  Knowing which parts of the folio are
unused is more important than using the larger TLB entries.

> > For the third case, in contrast, the parent had already established
> > an appropriate size folio to use for this VMA before calling fork().
> > Whether it is the parent or the child causing the COW, it should probably
> > inherit that choice and we should default to the same size folio that
> > was already found.
> 
> Actually this is not what THP does now. The current THP behavior is to
> split the PMD then fallback to order-0 page fault. For smaller orders,
> we may consider allocating a large folio.

I know it's not what THP does now.  I think that's because the gap
between PMD and PAGE size is too large and we end up wasting too much
memory.  We also have very crude mechanisms for determining when to
use THPs.  With the adaptive mechanism I described above, I think it's
time to change that.