From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABBC6C25B79 for ; Mon, 20 May 2024 14:35:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1DA4E6B0083; Mon, 20 May 2024 10:35:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 18A6F6B0089; Mon, 20 May 2024 10:35:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 051766B0098; Mon, 20 May 2024 10:35:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DBC266B0083 for ; Mon, 20 May 2024 10:35:03 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7FB11C0D64 for ; Mon, 20 May 2024 14:35:03 +0000 (UTC) X-FDA: 82139021286.09.D61D692 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf07.hostedemail.com (Postfix) with ESMTP id A50664000D for ; Mon, 20 May 2024 14:35:01 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=GwH2xV+7; spf=pass (imf07.hostedemail.com: domain of vbabka@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716215701; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xdNOGjDKO/QvpcQ7u05eJo0VJgUiQEc3Ra6AI+LrKvk=; b=VvBtj9Hiztm9zOLbCJpzEbuVELxkARLsFvP0nxKex1doROLvzdLIVdGmB0MAzQe5RZ/NNe 4KQXrFtlejUtRi3xLLK/zDn3cecc+lIG9wG72BpFYCBCcENn32DYFf6MJLwfNh2WxjrdSa RSv/QNc72gDlfAZAqKYlHoqroz3i9I8= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=GwH2xV+7; spf=pass (imf07.hostedemail.com: domain of vbabka@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716215701; a=rsa-sha256; cv=none; b=X8pVSEBkrl0eCwf+ffqSq0ZrOyVkuGvmI4KXzUDghk0KPeynE8g5+7z09+D2kXCbSxw5Gz sVBcHPS+AP7pxAVV9Nx7WxuedU1fpgeuKNLaa/lfAlDnB9ifubu0rXf3nvr7bRPjKaRAyz PDM4uIx1x6iw7jZZznqoJuCqjH0JcmU= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 7B7CD61CDF; Mon, 20 May 2024 14:35:00 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2FD91C2BD10; Mon, 20 May 2024 14:34:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1716215700; bh=BwhAB7EokZJNK6Kl9g/Z35ZxK3YqZRJWfcxn2IFUMbo=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=GwH2xV+7BaYNvf0KiIgcGbRlbrBWfTGQE4scAe1hJ6Bsy/waNZeA9P6zP5Bs7oTqC AQ5cV3ONFg8OzYca8BD5U7zGvNi0DDXP1qtEe2dPC+vzaPfmFe+6sAVBDmT4yDmMxs fzK6VPkkmynyVj89o9qRbnGVcqVRNOKggskqHNufxh1JiZN4TXJwzLOzdcY6R/RYSp xKJVVb7ZQj1ra2KiROzAeMwf0LtUAEykCdKvotSSM9EilgfePDuE4DMjkOGQZWKo+O QckGyVwZzQ7pcRJt2X0MqOhqHH60yayM28RGXkKNMEgJXTdT3IdzpUs+Zw+Fi23YC8 pvCRl1ZemwElA== Message-ID: <8d594a01-3e39-4347-9d4e-4db91845f6a3@kernel.org> Date: Mon, 20 May 2024 16:34:54 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSFMM] automating measuring memory fragmentation Content-Language: en-US To: Karim Manaouil , Yu Zhao Cc: Luis Chamberlain , David Bueso , Michal Hocko , Dan Williams , John Hubbard , Daniel Gomez , linux-mm , lsf-pc@lists.linux-foundation.org References: From: "Vlastimil Babka (SUSE)" In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: A50664000D X-Stat-Signature: ksxo7szz1c4sxmct8tc37brz7ug9xkz1 X-Rspam-User: X-HE-Tag: 1716215701-146895 X-HE-Meta: U2FsdGVkX18DEZseBQ/Om4kZhDKM3Kq+pbbKi52t1TTqlACV5sWKrsEZWprwyw/vORze12ewZHBDagBgRKHzpj1WWBN6TUzdlxnls1aMWj5wSI1PWEZB/g+USgCcYW4+QDui7EgIyzSIu6rqNV8s7P+s5NQNdign9rIYe7YinT5xWjzZPYl/11fHJvg21r68t71JgTsnF7WQcO2csj670fXGo1wpa6D2oHhoe7eLY4a3XcdujYNafOJjafdM1+DSGjXSYUNf0NZUGFIcW6bl8zH5xjUVff8SnTXboe3Szjy0TXYu4vpwjNoFqoV8N6lb+1VasdJeHgDp9JXDhXr2E1bioEWHgixsRvY0CRQo7TZ3h4mxIRL/ROWYnWpYK6ZMfXpVOMoyywnVpavzcLJMGjyVa0lxLB+sHn9fh9rfc71lcqyrylst+tmFpTTiUl7uPBo1nAJtCDriYGdW6ZYKaCh7nc/PAExELMsTZYFNZp9Tp5Oxj7c07mR8mQZjih/7IMP1A4RK6CnjOe91Mn5Yh/9rVxPty/lfL7rOCWzhXhNv8M7l5YVFw67bLt6Gmu21GT9qBHcxTAbj2TJ3TZKUuFrDEC2zocwAIr2cHH7HEOwU1i5e2n44qaRpx0/rxtWNHnLBaIc75ImC8DiugHLB4fQGEgiD7yJ5zJc+LnE7YCr5I9b5uTx2Kdflpi7qnXQvQOGGnflprbQ+S+j1k9EgtDaGkVCFySUxmvqjGx1vGnw0aWg4psdi+QCgjuCJ2T8ooQa2uwnMrDRbJSSTFhqip3e1SbIFHomz/W5UtIJrN5LSMJJGHl0tBinbudrXMey2rSeNESg1f6LChjZ6B6Pp3Sk8c15W0qXk1PNjQxml6obesJXRx2z9UxOizn6Ui3jgoHAcAQI9F6KUO03INIfQb6nhFoX63Tbf/pE0RXtO5LI8PIGHJ1Cqmiyx2eo/OodwQ+NoZCGEkU3yXd4wH2o jOpHzcbn yPoRRHACnvnARX67husOp5ba4wbVBbA3URwedsKkmNkwa/PoAbRIZBGN9rEH5RgDbBGBwuSmgVlqf8np1z3hKaUwbbVUqPiUu1KWf2zrg5Oxmh1EYUJxktzYqokmC1fzOOfaGA0i2hQUV8zBVt2ZloLQbbaSPB9XbsVfctwU9Ssv/LgEoqkJEvTS978QqFeZNLLO0fLplJqSC6QDytI0jZyeQ1WQshM1Z7J73SddBmA5bljyC941lW7xHyqJIjyh3Guw5mbCi+JWZQiVR04luda0yLz2PKgUf/G0bJ/oSEJywvYA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 5/16/24 11:32 PM, Karim Manaouil wrote: > On Thu, May 16, 2024 at 02:05:24PM -0600, Yu Zhao wrote: >> For example, if we have two systems, one has lower fragmentation for >> some orders but higher fragmentation for the rest, and the other is >> the opposite. How would we be able to use a single measure to describe >> this? IOW, I don't think a single measurement can describe all orders >> in a comparable way, which would be the weakest requirement we would >> have to impose. > >> As I (badly) explained earlier, a single value can't do that because >> different orders are not on the same footing (or so to speak), unless >> we are only interested in one non-zero order. So we would need >> fragmentation_index[NR_non_zero_orders]. > >> No, for example, A can allocate 4 order-1 but 0 order-2, and B can >> allocate 2 order-1 *or* 1 order-2, which one would you say is better >> or worse? This, IMO, depends on which order you are trying to >> allocate. Does it make sense? > > But higher order pages can always be broken down into lower order pages. > However, the inverse is not always gauranteed (they may not be buddies, > or compaction/reclaim isn't helpful). > > Obviously, I would rather have one order-4 page than two order-3 pages. > You can always satisfy an allocation for an order n if a page with an > order higher than n is available. > > One way to measure fragmentation is to compare how far we are from some > perfect value. The perfect value represents the case when all the free > memory is available as blocks of pageblock_order or MAX_PAGE_ORDER. > > I can do this as a one shot calculation, for example with > > static void estimate_numa_fragmentation(void) > { > pg_data_t *pgdat; > struct zone *z; > unsigned long fragscore; > unsigned long bestscore; > unsigned long nr_free; > int order; > > for_each_online_pgdat(pgdat) { > nr_free = fragscore = 0; > z = pgdat->node_zones; > while (z < (pgdat->node_zones + pgdat->nr_zones)) { > if (!populated_zone(z)) { > z++; > continue; > } > spin_lock_irq(&z->lock); > for (order = 0; order < NR_PAGE_ORDERS; order++) { > nr_free += z->free_area[order].nr_free << order; > fragscore += z->free_area[order].nr_free << (order * 2); > } > spin_unlock_irq(&z->lock); > z++; > cond_resched(); > } > bestscore = nr_free << MAX_PAGE_ORDER; > fragscore = ((bestscore - fragscore) * 100) / bestscore; > pr_info("fragscore on node %d: %lu\n", pgdat->node_id, fragscore); > } > } I've had a similar idea, which is maybe exactly the same as yours, I don't immediately see whether that's the case and I have not formalized mine, just have an intuitive explanation. But the basic premise is the same, if "continuity" is between 0 and 100, then all memory available in MAX_PAGE_ORDER should score 100. Then e.g. if we have 50% of the memory in MAX_PAGE_ORDER, then having the other 50% fully available in MAX_PAGE_ORDER-1 should give us another 25%, for a total of 75%, and so on. Maybe the decay of contribution to the continuity metrics for decreasing an order by 1 should be less than 2x, not sure. I think in case of Yu Zhao's example "A can allocate 4 order-1 but 0 order-2, and B can allocate 2 order-1 *or* 1 order-2" I think this should result in the same score at least? And of course it's not useful if we don't know what allocations we actually need, but maybe useful enough to compare two systems. > But there must be a way to streamline the calculation and update the value > with low overhead over time. I think instead of trying to track anything in the kernel and set it in stone, the metric should be calculated in userspace after reading the base values from /proc/buddyinfo (not fragindex) and maybe ignore small special zones like ZONE_DMA, and the per-order numbers of other zones could be summed together. Could also differentiate per migratetype using /proc/pagetypeinfo but then it's no longer a single number. > Cheers > Karim > PhD Student > Edinburgh University >