From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0AA97C54791
	for <linux-mm@archiver.kernel.org>; Mon, 11 Mar 2024 02:12:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 47FAE6B0074; Sun, 10 Mar 2024 22:12:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 408D76B0075; Sun, 10 Mar 2024 22:12:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2818B6B0078; Sun, 10 Mar 2024 22:12:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 118166B0074
	for <linux-mm@kvack.org>; Sun, 10 Mar 2024 22:12:52 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 9FCC54043C
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 02:12:51 +0000 (UTC)
X-FDA: 81883134942.06.D42ECE9
Received: from mail-oa1-f48.google.com (mail-oa1-f48.google.com [209.85.160.48])
	by imf03.hostedemail.com (Postfix) with ESMTP id D90022000C
	for <linux-mm@kvack.org>; Mon, 11 Mar 2024 02:12:49 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=0zzgbvct;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com;
	spf=pass (imf03.hostedemail.com: domain of david@fromorbit.com designates 209.85.160.48 as permitted sender) smtp.mailfrom=david@fromorbit.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710123170; a=rsa-sha256;
	cv=none;
	b=CjheJb1F2Th5sOWRhn0cwCG3dgkzEZxVbaP3d8ZO1u3sdzySJhEaxYoxnI7NmnO6lsBA/K
	T3K1nRFyTt8RdeCtxj+n68ZNBxCf6rDCt5EUHwHW1OS48KPPJ2VCRrjsEKjdLmBERqib9H
	D9O1pLJ2aDvW0x2RtYDY7EvzMN0hXew=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=0zzgbvct;
	dmarc=pass (policy=quarantine) header.from=fromorbit.com;
	spf=pass (imf03.hostedemail.com: domain of david@fromorbit.com designates 209.85.160.48 as permitted sender) smtp.mailfrom=david@fromorbit.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710123170;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=O5X/JUyfTxQT9RfC2ug9/o+GQees7fe0KS52UCEy/Y4=;
	b=c+yK1QqmmjnEGeDVt82/1nzJvPJLzuDmHqAq3MmBngeExdyipuZTNmIHr+5qjQJFNlm5Ja
	PNrKpxLoVs6FXzrS+d2BKAlaxmNg0r18D2DAeyrNxkp7JCij3/C8USXMy4B2sWn61mx+FV
	39Ihq/97YArWfco15N7wqmYVp6yNi+4=
Received: by mail-oa1-f48.google.com with SMTP id 586e51a60fabf-221ffba5c8bso54160fac.1
        for <linux-mm@kvack.org>; Sun, 10 Mar 2024 19:12:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1710123169; x=1710727969; darn=kvack.org;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to;
        bh=O5X/JUyfTxQT9RfC2ug9/o+GQees7fe0KS52UCEy/Y4=;
        b=0zzgbvctRtCzhrqNbkX1McoX+gNldUUsR/dGT3NV4DUQWmg0uLel5L9GU0YYmNB1h3
         8heT/mkPzYD/7h2UHQb/LCmFaSV6ZBq51A8xncK3dEFlo8dySm0VSGtfWhqkSTrHJuuZ
         kQX2ijn1TwxQ++kPcGE+INf3edKgf3HIpNKtS7ChWa0vIt5DkFWxA6/33iPbrtu2J9IT
         QK+aM9f0V3BXCn9PTlOx4KUkP2lJtZ9gPrHm850GlB/CzYmfoflogX6PuCaQNG73hPQz
         /WJ2AvZa4575O/h3WuZJj4rDB3WiyVZwykSO6EpWYKZs5OkGK8/AJSCBngGOQUfxLU/m
         76rg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710123169; x=1710727969;
        h=in-reply-to:content-disposition:mime-version:references:message-id
         :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=O5X/JUyfTxQT9RfC2ug9/o+GQees7fe0KS52UCEy/Y4=;
        b=OMRq7Dj1kAnSbgQpqAJHlKXmGNQFaDjWBQJqyHrSVPcQvuhRh0ZVWUPjVpWR8xVZLh
         B99nqlSyRTwb2IbO8Q1iRnfqEg3/s4hd+71ajmwgJBRfhKYaF3elrsaNgESfkjrVPO9d
         4xPWztZjwjL5XdfnHW+wUeiQT0RjUAS4FmxT3K33D2rRItUyVOy2qdcIkaoCYnOoFXEl
         AMSsn0qVS4qWNd/+T3YjHdiuNSo7xWS6YRffkMZwbe3z05o4cx66RPidwIHUEg182Jhm
         NSyiKe+tzx0YZcCpABYLrOb7LE18aB5w3lDDUnZzADt12q4/vTdA/mkEnreWLl3wTbTV
         CUrA==
X-Gm-Message-State: AOJu0Yy/3Xdu7KQveFiJH1tBE9cy9wKmi98DEJCozDhNG6i7wIDOqTiW
	HgCjd68YbFkVO/8xI1sA9Lv0hj9wWcuEJvOqCWEVHzu3KqW+IwsZgE2+XALToSM=
X-Google-Smtp-Source: AGHT+IG4+A1cVfYNN4zWw1iLg4uYer2MBhOZ4XSCvr2zHDKMkHOO0oaFFVTsgoKhh5AZRadLaRhaQg==
X-Received: by 2002:a05:6870:b688:b0:221:95d6:c3d1 with SMTP id cy8-20020a056870b68800b0022195d6c3d1mr6138583oab.37.1710123168605;
        Sun, 10 Mar 2024 19:12:48 -0700 (PDT)
Received: from dread.disaster.area (pa49-179-47-118.pa.nsw.optusnet.com.au. [49.179.47.118])
        by smtp.gmail.com with ESMTPSA id y12-20020a056a00038c00b006e57247f4e5sm3160809pfs.8.2024.03.10.19.12.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Sun, 10 Mar 2024 19:12:48 -0700 (PDT)
Received: from dave by dread.disaster.area with local (Exim 4.96)
	(envelope-from <david@fromorbit.com>)
	id 1rjV9h-000G8X-1P;
	Mon, 11 Mar 2024 13:12:45 +1100
Date: Mon, 11 Mar 2024 13:12:45 +1100
From: Dave Chinner <david@fromorbit.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: On the optimum size of a batch
Message-ID: <Ze5onaXsI+LT1+Be@dread.disaster.area>
References: <Zeoble0xJQYEAriE@casper.infradead.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Zeoble0xJQYEAriE@casper.infradead.org>
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: D90022000C
X-Stat-Signature: x5k3egrytimi8hp3u75sh75k71ymx5kg
X-HE-Tag: 1710123169-543576
X-HE-Meta: U2FsdGVkX1/ZiAdSNsEeJFOlLWvbSD59RCuc3XMED1Ua6lxaiOvXfAVuugY4IZ5qPE9GEW9ith5ceaiyJWGCAJgFhK7ZcbUY18fCTMyGJkOuVbYz/xNMcadKg6RpX+j+YTBD1lc4nZWmevQczPz2MK57xTJIZTRU4wcaV3cwmdzLNQOJGsMZh5SBs10ydQFUBfvSalIgZQOEEnK2r+3OuOokCVuoohV8CVivPXqpFZFXPyxR652uVdXZwhqYtip9WXI+GWYjLjgYEaBtuOvhDZpGZzYMH3Hgp+2sJyGjhmEG3z42tOWsuuUUvC6hMx9rPYVP3ey6FCwKA5mnbfP0/G97Bw7H+FvRN3mX8J94d58NqCMuXdamJjdiggSnCgN22605OnxcGJrK3tFW86FQIaMwsbnaIfF/O8k94OIgxM6sJimdYVc14kkQVBEEjmJHq3IgDNl53GhkzZjEoFkWtU/N2MwsmLBiqqugeZhDlBgXZWJUL+RWk9KYcG/3xt+TNo9u75w4CW9Ie8sILTAWgeCTlJXUm3IpgNdCZYbB2v7jN3Wu15BdKjTFP4qbW3zUtUIsr1RldSXLcifLnGIRuUkBwZ8oqCt+IIjkGZ9Q6S0k/rCUjhq6C5jF/lT0EwhddUIJZxR41Cg3fO4FFL5o2IuCbn62fVDJKofr0ulyQxJREl/UrUP6LucxijZ3yRQfPm58ksTR5XOoW+5Fn2JW6sYwLfsO3nBjahIzu8GSIvrjzdpUzZdz0NBBTCA/wAlIFdqeHAesSKAJEuuR7segSNHvjI24a/q+2oNIsTHHsNpmkyXJjajHR8cvX19r4dFI3H6dP55X/62C+8HqH0jGzhiyVlQXO5PbAKBKZg1Q0QM84d0geJZElOjls5w7aTkH/ZKlbbgZVrJ7ll3D/hzXIh2mc/mOxWt68ZSSHWV7GGS7KEOeNA+rO+Z31v08GZ+QCgAlLgX1daYH9U4ctNz
 KXZwL+U9
 NzvMMxSeaNBZqrVkmg4jC2S31hFQHBxFiwiiVlNpVwmH8hOuU+NeahOeR8M7W6ggLyxO7GMRwZEAAXmbZY90Mq5TFOd3v5jh2/kcqthlIs/FjgVXdF6E1inCBpA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.002452, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Mar 07, 2024 at 07:55:01PM +0000, Matthew Wilcox wrote:
> I've had a few conversations recently about how many objects should be
> in a batch in some disparate contextx, so I thought I'd write down my
> opinion so I can refer to it in future.  TLDR: Start your batch size
> around 10, adjust the batch size when measurements tell you to change it.
> 
> In this model, let's look at the cost of allocating N objects from an
> allocator.  Assume there's a fixed cost, say 4 (units are not relevant
> here) for going into the allocator and then there's a 1 unit cost per
> object (eg we're taking a spinlock, pulling N objects out of the data
> structure and releasing the spinlock again).
> 
> Allocating 100 * 1 objects would cost 500 units.  Our best case is that
> we could save 396 units by allocating a batch of 100.  But we probably
> don't know how many objects we're going to need to allocate, so we pull
> objects from the allocator in smaller batches.  Here's a table showing
> the costs for different batch sizes:
> 
> Batch size      Cost of allocating 100          thousand        million
> 1               500 (5 * 100)                   5000            5M
> 2               300 (6 * 50)                    3000            3M
> 4               200 (8 * 25)                    2000            2M
> 8               156 (12 * 13)                   1500            1.5M
> 16              140 (20 * 7)                    1260            1.25M
> 32              144 (36 * 4)                    1152            1.13M
> 64              136 (68 * 2)                    1088            1.06M
> 128             132 (132 * 1)                   1056            1.03M

Isn't this just repeating the fundamental observation that SLUB is
based on?  i.e. it can use high-order pages so that it can
pre-allocate optimally sized batches of objects regardless of their
size? i.e.  it tries to size the backing page order to allocate in
chunks of 30-40 objects at a time?

> You can see the knee of this curve is around 8.  It fluctuates a bit after
> that depending on how many "left over" objects we have after allocating
> the 100 it turned out that we needed.  Even if we think we're going to
> be dealing with a _lot_ of objects (the thousand and million column),
> we've got most of the advantage by the time we get to 8 (eg a reduction
> of 3.5M from a total possible reduction of 4M), and while I wouldn't
> sneeze at getting a few more percentage points of overhead reduction,
> we're scrabbling at the margins now, not getting big wins.

Except for SLUB we're actually allocating in the hundreds of
millions to billions of objects on machines with TBs of RAM. IOWs we
really want to be much further down the curve than 8 - batches of at
least 32-64 have significantly lower cost and that matters when
scaling to (and beyond) hundreds of millions of objects....

> This is a simple model for only one situation.  If we have a locking
> contention breakdown, the overhead cost might be much higher than 4 units,
> and that would lead us to a larger batch size.
> 
> Another consideration is how much of each object we have to touch.
> put_pages_list() is frequently called with batches of 500 pages.  In order
> to free a folio, we have to manipulate its contents, so touching at least
> one cacheline per object.

Right, that's simply the cost of the batch cache footprint issue
rather than a "fixed cost mitigation" described for allocation.

So I'm not sure what you're trying to say here? We've known about
these batch optimisation considerations for a long, long time and
that batch size optimisation is always algorithm and access pattern
dependent, so.... ???

> And we make multiple passes over the batch,
> first decrementing the refcount, removing it from the lru list; second
> uncharging the folios from the memcg (writes to folio->memcg_data);
> third calling free_pages_prepare which, eg, sets ->mapping to NULL;
> fourth putting the folio on the pcp list (writing to the list_head).

Sounds like "batch cache footprint" would be reduced by inverting
that algorithm and doing all the work on a single object in a single
pass, rahter than doing it in multiple passes.  That way the cache
footprint of the batch is determined entirely by the size of the
data structures accessed to process each object in the batch.

i.e. if you are going to take an L1 cache miss accessing every
object in the batch anyway, then reducing batch size doesn't improve
overall per-object processing efficiency. All it does is keep the
processing cost down to a single L1 cache miss per object in the
batch. The tradeoff for this is more frequent batch refills, so this
only works is the additional fixed cost for obtaining each batch is
lower than the cost of multiple L1 cache misses per object....

All this says to me is that sometimes the batch size is not actually
the problem that needs fixing - changing the algorithm
and/or processing pipeline to remove the possiblity of repeated
accesses to individual objects in the batch reduces selecting the
batch size down to the same "fixed cost mitigation" case you started
with....

-Dave.
-- 
Dave Chinner
david@fromorbit.com