From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C145EE7717F for ; Tue, 10 Dec 2024 15:59:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 50C526B0204; Tue, 10 Dec 2024 10:59:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4BBDD6B0206; Tue, 10 Dec 2024 10:59:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 35D906B0207; Tue, 10 Dec 2024 10:59:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1A9206B0204 for ; Tue, 10 Dec 2024 10:59:29 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9FBC31408E1 for ; Tue, 10 Dec 2024 15:59:28 +0000 (UTC) X-FDA: 82879507620.17.122C461 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) by imf13.hostedemail.com (Postfix) with ESMTP id BD0152000B for ; Tue, 10 Dec 2024 15:59:04 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DdlpySua; spf=pass (imf13.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733846352; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wubk0prrUfZcPCKTjSsxQklCZ6pNG/rGo93+uqQj5JM=; b=RJ6JE/RXzfF0KX1DFvmg0OPGETrOV9iSpUf3boDybqHMj0j9ho7goVf5a7TrgIL0ODPQK2 JXMAd23GlRT2cHwz0YtPGaLQzJaNxxm9lXnJSaJu7QLdSYECrfWNFegQwOP95WshhO6F/H yevPWloe+j/m6MOXAj+jc4U5s7KsWrs= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733846352; a=rsa-sha256; cv=none; b=J0bf4JwZBvkY0kgaA9BOeVQWavrnx50NlZm0EPD8vD+wgqIXiLJm7nKHYcrg4LxjQpPKHs biP5ro8eI/Nm11v0U6PfWWHlGAwddR581t704170KL4JizZJ1tU4bUxnIJ56ulc2/DsBla oy+xjEcl6EjcyQnGHo2w4WG36henSrQ= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DdlpySua; spf=pass (imf13.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.128.50 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-434e69857d9so20124135e9.0 for ; Tue, 10 Dec 2024 07:59:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1733846365; x=1734451165; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=wubk0prrUfZcPCKTjSsxQklCZ6pNG/rGo93+uqQj5JM=; b=DdlpySuaBt2tLwz926XfCK8c3K38DZTlabDijclyc2r3MI3zMw9nzTLO9W7Kp5vIFt PB9hZ3O7WCGcIWo/mCAK00Lvb3lIE7byiHY1daBZ/8meKk74S/Yb3LBnUD45rBq8qFDQ lLJoywLvN5OiUJu41/Ux6TRf/pAPYM26CxtvfPl+HkPQX+hZQEj/MvfHTsS5fnE4q/i8 wE6ZZh7jhAvN6juLRf6DPcuOmKvFwlVFWYcrzMTqjz9wmdc0Y2J8cp02yoIL1FQRbIUo 3P9wbsoJ3oI0SXoikXq/akJyxFg0z7GEzm0uDz6GIv6NFgI4V8KlZx6ym9KliYV9e3NV UTgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733846365; x=1734451165; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wubk0prrUfZcPCKTjSsxQklCZ6pNG/rGo93+uqQj5JM=; b=DXXan4zQXQhZzSCTQSMhM7/06AENA2p638RzAbtvWwbBhcjQ8CRAKCiPsQTW/aaolG TEZ4u5Ozlj9pX3UoG5yKA1VZmP3zBofvcZBpfpI+23Q2g3HcMMhSLwAWQNmI7gPGTCL0 neTWVws1tAZKePJtu2Z9gth5FM+a4nutaVAoUglQvO3SIKkop8LsxVZUtw9LMPHxxZUh G8FVO8MN3DtKmNNxyPhIKanSg+NYfJ8YMievGE6WGi14/P+S3RtTkYMtPjJPxTjjJoSC pg7dhR+U+jqfFE4k2wSe6jXIL/6JBo89nxzLu8/rbJ9rnJlhMu+Pl17GIYdJmsWAxGG6 n6Tw== X-Forwarded-Encrypted: i=1; AJvYcCWWXEbsIgUzOZQggh6kzKhZ+We/Y3viQaV8pyuqbzvOjs+DPVp+0cxC9om7/dVBTMsGuDmbpPRlMQ==@kvack.org X-Gm-Message-State: AOJu0YwKOFbm6p3jIOfVgc8V1r18yR9MoasLWF3EJI1uwNOLVmkWIR+Z 3YM8SO5iB3A/hr3Kv6s46SB3gLqvhJbleGLH97T53MI7SuvXlnUSEsZX7wf1MBGAfi2LSmfi3nU VTuFiTNl1HTwJ134oomoOkj7s7sA= X-Gm-Gg: ASbGnctWtvyuuu3tTLcgHuEYHqUPTGbd+0gCsNod/t49rbwe5hYqW+DFF5xxBQlShg3 ncBNcLWbmsFEmrsI+L14jziC+YlfNkGTGb1lzPZ0rssRZu3dqM5WBVJClg0pe/bKi7klR X-Google-Smtp-Source: AGHT+IEbRhG2/j/Qp5k/ecinTGRv5hY7zq64U1QHe6qQjpYUtAkrPBxMf5QU4PUu3agLMhJi6tRPf9V8cdr8WoRSw4E= X-Received: by 2002:a05:600c:4a13:b0:436:17a6:32ee with SMTP id 5b1f17b1804b1-43617a634b1mr17927285e9.10.1733846365108; Tue, 10 Dec 2024 07:59:25 -0800 (PST) MIME-Version: 1.0 References: <20241206122533.3589947-1-linyunsheng@huawei.com> <3de1b8a3-ae4f-492f-969d-bc6f2c145d09@huawei.com> <15723762-7800-4498-845e-7383a88f147b@huawei.com> In-Reply-To: <15723762-7800-4498-845e-7383a88f147b@huawei.com> From: Alexander Duyck Date: Tue, 10 Dec 2024 07:58:48 -0800 Message-ID: Subject: Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) To: Yunsheng Lin Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Shuah Khan , Andrew Morton , Linux-MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: BD0152000B X-Stat-Signature: yquc41h5s46risyohf6zpm1gamwf9jmw X-Rspam-User: X-HE-Tag: 1733846344-916067 X-HE-Meta: U2FsdGVkX1+5RDe64vc42qcxsSqXXI96gKwO2tIVAV1LiUwbcmBT7Uw69rcXx6XbNIG0cJAeRNJ0r0Zmok/GWC4eD1aK/d2j7CQdDlPAx0JsL6GKGulWjMzbe9oMflWyrG/fFUsxzl92M13W5cyMrk7Dmxza645g/lhSglfM0BQgB/f8O7AJEjcd4EOSL54nXln7lypKuOCz9zJs/mpInObnzfBk8qqEeOubJUMSrQL0AMSacFiDK+qGcDBnj7dCXcmalL2HCPE4YktF65NFtIGEegdhIUkI9jYlzQOvmOfuF77tWjKVpKzJ9MqiQv19YWINjlg0gPOS3x82d0mx8B2pt/gn1KfR+iFT2+J/RVwz2p6v7M9a/MKD67COIIQ83rlwyXZIdAQ1EDzyMfeI+l5vCw/a/r1i9d0SsT6JUxBBMJqAflnF+S4NrEMEXxoWZj2KULU8gLx42YnkJglr8kcZfviuKrdRxf1EOS1YNfOt1Xl46rxRpynpmEdD1Pyno/ZyUKfYINCiEqt9NyV/zwC7DS80t2/4wo0BxcNSwmuAzdXYFip03BKB+0qbbjyYCahebQqEKpdWpDxQYYHyfgaBLZhkWP2vX/6jZcycxSWZfGRnNgU/Nh0dGyVFykuCO6pi3SI+4LhNcQBuaEyuwJDlQa6kMPdX98tATibMsBQeri+yguei0tWuOLoq7pCiGZ/xgKyyRUVX8p712QN4+sKsRZaHZi8VfjrGFDnp4mtjZ0u2Uc+Ko5qP0walyp+g96jvc7i5DnqCS3Yuc6772QbIukvzQ3wdKOzFBkfTIDktyIavyGmAbk9Mdt8x2RF7o3C9/5Ec8It+eI4dnUbNvSPIrvnLfXeCeET5kLRgts5u6sNiQ5B5qrkzaaAJIXdkpYALsVA8el2AWO1xefPQbcugwa3WD24nMwUHMjewxTk+H+sNwz3aUlsepmq4B451xgxCMonFkt1OKsdJTFF xUG/2C89 K9C00LA4KdR/EQqwvPUAZRlme+ax3DE2IvAyK4dPyHU9A/ZUmQnmXSXndTtNbAWctyY4CIkzHlOH+hbhHLh1+bIBt99n7LGykt3y50d6j0uezk++YWqdJ8Q8EP4catLwYmYAUfkNiJxIj1qCebI/Dw21MTqSORt3zKCRh9+3UtaMso8iyzxHQqVXJoKSP4wz0jX9Du5HvBerTMCJlb8pRmONSDyPLKTxOfGg+r2G3PcX++KaBk3YlV+MQLZwEotLjKAyyP6aX8plUbb1DiRRWsPHPBczbV6aURacFxpag5sTTAzAfooFY3UTwPBKvzD4k4yTosLdgzxc2SVxLk6N7EheYvIxDggu+4QiObhOCcxoTPXRRJLqkkTeO5nZbX6RQBqDJfRI/zHmLMbEJodawj/eXGPKFB8eQGumkQslpVQdNlZgCw0TSScqA6qxAPXr1CvbWLI6GzRLsWjk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.415103, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Dec 10, 2024 at 4:27=E2=80=AFAM Yunsheng Lin wrote: > > On 2024/12/10 0:03, Alexander Duyck wrote: > > ... > > > > > Other than code size have you tried using perf to profile the > > benchmark before and after. I suspect that would be telling about > > which code changes are the most likely to be causing the issues. > > Overall I don't think the size has increased all that much. I suspect > > most of this is the fact that you are inlining more of the > > functionality. > > It seems the testing result is very sensitive to code changing and > reorganizing, as using the patch at the end to avoid the problem of > 'perf stat' not including data from the kernel thread seems to provide > more reasonable performance data. > > It seems the most obvious difference is 'insn per cycle' and I am not > sure how to interpret the difference of below data for the performance > degradation yet. > > With patch 1: > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko t= est_push_cpu=3D-1 test_pop_cpu=3D1 test_alloc_len=3D12 nr_test=3D51200000': > > 5473.815250 task-clock (msec) # 0.984 CPUs utilize= d > 18 context-switches # 0.003 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.022 K/sec > 14210894727 cycles # 2.596 GHz = (92.78%) > 18903171767 instructions # 1.33 insn per cyc= le (92.82%) > 2997494420 branches # 547.606 M/sec = (92.84%) > 7539978 branch-misses # 0.25% of all branc= hes (92.84%) > 6291190031 L1-dcache-loads # 1149.325 M/sec = (92.78%) > 29874701 L1-dcache-load-misses # 0.47% of all L1-dc= ache hits (92.82%) > 57979668 LLC-loads # 10.592 M/sec = (92.79%) > 347822 LLC-load-misses # 0.01% of all LL-ca= che hits (92.90%) > 5946042629 L1-icache-loads # 1086.270 M/sec = (92.91%) > 193877 L1-icache-load-misses = (92.91%) > 6820220221 dTLB-loads # 1245.972 M/sec = (92.91%) > 137999 dTLB-load-misses # 0.00% of all dTLB = cache hits (92.91%) > 5947607438 iTLB-loads # 1086.556 M/sec = (92.91%) > 210 iTLB-load-misses # 0.00% of all iTLB = cache hits (85.66%) > L1-dcache-prefetches > L1-dcache-prefetch-misses > > 5.563068950 seconds time elapsed > > Without patch 1: > root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_tes= t.ko test_push_cpu=3D-1 test_pop_cpu=3D1 test_alloc_len=3D12 nr_test=3D5120= 0000 > insmod: can't insert './page_frag_test.ko': Resource temporarily unavaila= ble > > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko t= est_push_cpu=3D-1 test_pop_cpu=3D1 test_alloc_len=3D12 nr_test=3D51200000': > > 5306.644600 task-clock (msec) # 0.984 CPUs utilize= d > 15 context-switches # 0.003 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.023 K/sec > 13776872322 cycles # 2.596 GHz = (92.84%) > 13257649773 instructions # 0.96 insn per cyc= le (92.82%) > 2446901087 branches # 461.101 M/sec = (92.91%) > 7172751 branch-misses # 0.29% of all branc= hes (92.84%) > 5041456343 L1-dcache-loads # 950.027 M/sec = (92.84%) > 38418414 L1-dcache-load-misses # 0.76% of all L1-dc= ache hits (92.76%) > 65486400 LLC-loads # 12.340 M/sec = (92.82%) > 191497 LLC-load-misses # 0.01% of all LL-ca= che hits (92.79%) > 4906456833 L1-icache-loads # 924.587 M/sec = (92.90%) > 175208 L1-icache-load-misses = (92.91%) > 5539879607 dTLB-loads # 1043.952 M/sec = (92.91%) > 140166 dTLB-load-misses # 0.00% of all dTLB = cache hits (92.91%) > 4906685698 iTLB-loads # 924.631 M/sec = (92.91%) > 170 iTLB-load-misses # 0.00% of all iTLB = cache hits (85.66%) > L1-dcache-prefetches > L1-dcache-prefetch-misses > > 5.395104330 seconds time elapsed > > > Below is perf data for aligned API without patch 1, as above non-aligned > API also use test_alloc_len as 12, theoretically the performance data > should not be better than the non-aligned API as the aligned API will do > the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems > to show otherwise and I am not sure how to interpret that too: > perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu= =3D-1 test_pop_cpu=3D1 test_alloc_len=3D12 nr_test=3D51200000 test_align=3D= 1 > insmod: can't insert './page_frag_test.ko': Resource temporarily unavaila= ble > > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko t= est_push_cpu=3D-1 test_pop_cpu=3D1 test_alloc_len=3D12 nr_test=3D51200000 t= est_align=3D1': > > 2447.553100 task-clock (msec) # 0.965 CPUs utilize= d > 9 context-switches # 0.004 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.050 K/sec > 6354149177 cycles # 2.596 GHz = (92.81%) > 6467793726 instructions # 1.02 insn per cyc= le (92.76%) > 1120749183 branches # 457.906 M/sec = (92.81%) > 7370402 branch-misses # 0.66% of all branc= hes (92.81%) > 2847963759 L1-dcache-loads # 1163.596 M/sec = (92.76%) > 39439592 L1-dcache-load-misses # 1.38% of all L1-dc= ache hits (92.77%) > 42553468 LLC-loads # 17.386 M/sec = (92.71%) > 95960 LLC-load-misses # 0.01% of all LL-ca= che hits (92.94%) > 2554887203 L1-icache-loads # 1043.854 M/sec = (92.97%) > 118902 L1-icache-load-misses = (92.97%) > 3365755289 dTLB-loads # 1375.151 M/sec = (92.97%) > 81401 dTLB-load-misses # 0.00% of all dTLB = cache hits (92.97%) > 2554882937 iTLB-loads # 1043.852 M/sec = (92.97%) > 159 iTLB-load-misses # 0.00% of all iTLB = cache hits (85.58%) > L1-dcache-prefetches > L1-dcache-prefetch-misses > > 2.535085780 seconds time elapsed I'm not sure perf stat will tell us much as it is really too high level to give us much in the way of details. I would be more interested in the output from perf record -g followed by a perf report, or maybe even just a snapshot from perf top while the test is running. That should show us where the CPU is spending most of its time and what areas are hot in the before and after graphs.