From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3957D33990 for ; Mon, 28 Oct 2024 15:31:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E1BEC6B0088; Mon, 28 Oct 2024 11:31:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DCB0E6B0089; Mon, 28 Oct 2024 11:31:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C9D5C6B008A; Mon, 28 Oct 2024 11:31:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id AA1686B0088 for ; Mon, 28 Oct 2024 11:31:56 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 33C1F14040A for ; Mon, 28 Oct 2024 15:31:25 +0000 (UTC) X-FDA: 82723399080.08.4F38DD7 Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) by imf20.hostedemail.com (Postfix) with ESMTP id 4598E1C0025 for ; Mon, 28 Oct 2024 15:30:54 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=X9rO18Wa; spf=pass (imf20.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730129273; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JdQ4vs11iNmJD8fN5ThoN3BSV2oBoJ+cRoPmnNutZrU=; b=GvAXaZJ/xXdDAMNZmo0ho0g//pctA+d7b9EAgHl+IMBOMUsoEBk+e46f2LWFZdap7BHOl4 4Mb44Ishy053JuPzpfbBS1cmJs9fOg5W/bOl1szcz01G1x0FG8GqDjeGUPB97DEV/xtprW ewONJ+762XsvNnm5HJsiLO5Fo14us5Y= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=X9rO18Wa; spf=pass (imf20.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.128.45 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730129273; a=rsa-sha256; cv=none; b=M9D4dgdQM4Fb0SRIUgnJrj37sBtEkPt8ki3Gu1c/33oKRjkHQFBVGY/rut4lYSwD4TNbXu ox87dJusiFaFNABLxegqXEhbz0lK1QTYLTqjndXx8imyp3ZQd4Mc5oZmKd9L724741eqgH uZNUmBrAfG3RQYbAuZOoWnODEULKXOk= Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-431481433bdso44760975e9.3 for ; Mon, 28 Oct 2024 08:31:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1730129482; x=1730734282; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JdQ4vs11iNmJD8fN5ThoN3BSV2oBoJ+cRoPmnNutZrU=; b=X9rO18WayMahNyu3aEeXtXkMRzoJtwVKtpfZbkPXG3wH5fR1FbTzs4l42/1ynASqBM hVzH4KvftjPoN0df/G/Y3e+GAfPIzGrYfi7t8dxKRDBHd2djg2BC4AQRZIkIrM0qDBTe ZoW4/X7a176W8JiqhsQY+k9jl6CxFN/UKnC294XjaehtjQiKqxo574yCEuot+Vq9kesR Oy4KtpTWracNvTJ9pSNUOjJwxb1EV6BMGqAT0FI8vDvXJJm+YiSmPcFzViE4+YnOhNPC duj8hT0OV/nQSYJ8fnUnQT7B2SoQU/3o6ZoDBiIomt57ZBpIqyDEzqH2PCIzPyt3KN0k fPyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730129482; x=1730734282; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JdQ4vs11iNmJD8fN5ThoN3BSV2oBoJ+cRoPmnNutZrU=; b=OJjzuM65E7kOitDffj3r9W6r8Lx3rO0rMhLVp/epz+FZhoq70JpMJz7KB/wv/z4EC4 PpT+W1jElFldSl6t9sM9xjW/mmCAfGD1fCdVF1Nlf2qCF2elNHpf4fkaz8wNnjjpA+rl Okh6Bem35u3j5e3z8p6Hyx/ciDaprjM87mXq85pCyCDE9rlLC5yIJMwXFU30Yrv569WQ QrnS+MuQPMTR9o51kub+eaTEVQot8RmAa1WzFgfzqJfv4NYzlJVMaNE8+c+cJRFZsrzJ v6xXugX8q3FsJ5E79SUjZzEqYn9g87d5VsPdhM5h2lFu0b/SrjJCmWMwx2HzoHFhYRhm PEhA== X-Forwarded-Encrypted: i=1; AJvYcCXoftu9V8JlQ1tgjt4MxLeLLFOTuOZfSuVPRpYUyMrl1Vhc98LWGXEI/TELHa2+zwrxHoXcugo6IA==@kvack.org X-Gm-Message-State: AOJu0Yy6zVP9ouDcizF6bdVON6BiSiastxCftYTq+oMvXjP3Hn+kL/yu /nZWYKWwBb8lo1iMaH/aOZHf3n+osJN70/oChjLgSfmDJwJV+XaxdRR9M5C2PnxqJ9q/8YYB+r/ d84bA2P/kFseoCIZSUCGfaLKvTzo= X-Google-Smtp-Source: AGHT+IHJOzDlndQVn/Lo0VLu1f2PcdBMfILUNivm5fodUlndiwZCL2w3KID7GCjlNVbZK5g4+g92PYUnVn8fyBUg/do= X-Received: by 2002:adf:f74b:0:b0:37c:d2f3:b3b0 with SMTP id ffacd0b85a97d-38061144954mr6400481f8f.23.1730129481489; Mon, 28 Oct 2024 08:31:21 -0700 (PDT) MIME-Version: 1.0 References: <20241028115343.3405838-1-linyunsheng@huawei.com> In-Reply-To: <20241028115343.3405838-1-linyunsheng@huawei.com> From: Alexander Duyck Date: Mon, 28 Oct 2024 08:30:45 -0700 Message-ID: Subject: Re: [PATCH net-next v23 0/7] Replace page_frag with page_frag_cache (Part-1) To: Yunsheng Lin Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Shuah Khan , Andrew Morton , Linux-MM Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 4598E1C0025 X-Stat-Signature: 13tgbpbx18kpn7gtjtxbzsww75s5xxwd X-Rspam-User: X-HE-Tag: 1730129454-556370 X-HE-Meta: U2FsdGVkX194NcIVFxSTmjIP0g3y7KbrAjPWr7oWToWMsxFtA9Obpk2OY4fI86qRX8xWbElufTxk7dQ+KrS06fx7t+6uqfn0fUcPvbDnO+hLB5MzZlrQx4Q0VIweGUcJ+pN8J4jTzhhCV27ZFgmSubgj2mUpJszuJkkI597lOIYjDpCt7slhhgR4QNMLN6XBjo7nUkkk7Jy//XQe3M8QwEvvoBqAxh4nLWwpZAKHt6ZZfeEL8F1/30w3SBAJH03sTlhnxP29ol+BW3yQu4VXJjLNk1JN9BFEZzHmxT6YsReDVi2hRXY6B388kwlBhmPyGdMry1pHdZbn3XA7/DGFr7xN7j9wrCqW21KOjIQ+DYFjGvmUnWeY6uyrHeWH4D04DKdlckMj3wyilrQYMOoymwysyFExFx2BmYxOUnoF7Jwzaf2mOUXONfwpFM8aBDs/ZpJUvv8tlDH196GpSa0k7Xl9P5ROQ2HUFar74zw68m0olp78LyqTV6WNpRalQesg1CqWnSHv93t0PqaJ5DWs2XZPSTddbymwhhs8yjZVGQ09uV8MjOhPvjWW3812tWVVGJ0kxfcuiP75u+aNbEdVnX5LkZVY1f4qPjzWNoZ2tr6+OVCtmmckPnlhZBlB9xLAK7rsygf6flKlC4UNRzNqTEwAI+YMr6tr/aWdxj92TDr7wmgGrXJ5KaPk1as7fIw/g3uxuR1LFPpBNW+SOGNrKfvuSkBKIOrrH30QSa1zcEsghr9rMIhV4uU+WXzTi9TFpJv1B0LPm08epO/aCTU3r0U2xEbpXilU+C+tNQc6tPU7dj+SOcuikqVGL3xNjtJF8jUbl5vR8Aiq/UbbAfJpRKdQBQ0rQMCApid53Fw+1j8J+dhx4AfYQPPk+1oOOV03rHKraGYIO8LRMVA5/oknYiAe2iCgoUOTh5UT+XoJaxeq/QFTtlY+nFENBDIAsJC/qk/Lb09T5W5Z8AMtABq hgZ66kAn TeRShRKTRdh2vzlE+dd+61AIHwl9suoy4hycuxKzvezshD0jZqgMFlHrrRoM2ohuj5czaYHNbG7RpVXFBwl+5hYOcctSGGRuwshCNdWKtXbAHYXjkIQ9BJBHXvv+P4O2YRf9KORYpTWld0cqz4hEphvjRSwenDeFNYkmWTU3kTp6z4Ay8DK8DKsBfD+k7qpbyofFiKteZmf6z/qnqDO3la91aSzs+dfg04CmS1I3scSGACW1saeQLKfiFOwex4D9wYuCylUqOSnE1M+zpZOxCZx3bvMfYpesf+WQBxs7IYSjvX1DHfn5I5SfQA3czKNI4jtnfXaStATMZYMx+p56TBbYDiM9DOUxEBCtADlc5D563e/arB3r+zGvGjlGozkOaME1sMGBCEaQnGbFC+lyGndfe3FiymuPYwU1VgXX5YboP0Iw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Oct 28, 2024 at 5:00=E2=80=AFAM Yunsheng Lin wrote: > > This is part 1 of "Replace page_frag with page_frag_cache", > which mainly contain refactoring and optimization for the > implementation of page_frag API before the replacing. > > As the discussion in [1], it would be better to target net-next > tree to get more testing as all the callers page_frag API are > in networking, and the chance of conflicting with MM tree seems > low as implementation of page_frag API seems quite self-contained. > > After [2], there are still two implementations for page frag: > > 1. mm/page_alloc.c: net stack seems to be using it in the > rx part with 'struct page_frag_cache' and the main API > being page_frag_alloc_align(). > 2. net/core/sock.c: net stack seems to be using it in the > tx part with 'struct page_frag' and the main API being > skb_page_frag_refill(). > > This patchset tries to unfiy the page frag implementation > by replacing page_frag with page_frag_cache for sk_page_frag() > first. net_high_order_alloc_disable_key for the implementation > in net/core/sock.c doesn't seems matter that much now as pcp > is also supported for high-order pages: > commit 44042b449872 ("mm/page_alloc: allow high-order pages to > be stored on the per-cpu lists") > > As the related change is mostly related to networking, so > targeting the net-next. And will try to replace the rest > of page_frag in the follow patchset. > > After this patchset: > 1. Unify the page frag implementation by taking the best out of > two the existing implementations: we are able to save some space > for the 'page_frag_cache' API user, and avoid 'get_page()' for > the old 'page_frag' API user. > 2. Future bugfix and performance can be done in one place, hence > improving maintainability of page_frag's implementation. > > Kernel Image changing: > Linux Kernel total | text data bss > ------------------------------------------------------ > after 45250307 | 27274279 17209996 766032 > before 45254134 | 27278118 17209984 766032 > delta -3827 | -3839 +12 +0 > > Performance validation: > 1. Using micro-benchmark ko added in patch 1 to test aligned and > non-aligned API performance impact for the existing users, there > is no notiable performance degradation. Instead we seems to have > some major performance boot for both aligned and non-aligned API > after switching to ptr_ring for testing, respectively about 200% > and 10% improvement in arm64 server as below. > > 2. Use the below netcat test case, we also have some minor > performance boot for replacing 'page_frag' with 'page_frag_cache' > after this patchset. > server: taskset -c 32 nc -l -k 1234 > /dev/null > client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | tasks= et -c 1 nc 127.0.0.1 1234 > > In order to avoid performance noise as much as possible, the testing > is done in system without any other load and have enough iterations to > prove the data is stable enough, complete log for testing is below: > > perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=3D16 test_po= p_cpu=3D17 test_alloc_len=3D12 nr_test=3D51200000 > perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=3D16 test_po= p_cpu=3D17 test_alloc_len=3D12 nr_test=3D51200000 test_align=3D1 > taskset -c 32 nc -l -k 1234 > /dev/null > perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc = 127.0.0.1 1234 > > *After* this patchset: > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu= =3D16 test_pop_cpu=3D17 test_alloc_len=3D12 nr_test=3D51200000' (200 runs): > > 17.758393 task-clock (msec) # 0.004 CPUs utilize= d ( +- 0.51% ) > 5 context-switches # 0.293 K/sec = ( +- 0.65% ) > 0 cpu-migrations # 0.008 K/sec = ( +- 17.21% ) > 74 page-faults # 0.004 M/sec = ( +- 0.12% ) > 46128650 cycles # 2.598 GHz = ( +- 0.51% ) > 60810511 instructions # 1.32 insn per cyc= le ( +- 0.04% ) > 14764914 branches # 831.433 M/sec = ( +- 0.04% ) > 19281 branch-misses # 0.13% of all branc= hes ( +- 0.13% ) > > 4.240273854 seconds time elapsed = ( +- 0.13% ) > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu= =3D16 test_pop_cpu=3D17 test_alloc_len=3D12 nr_test=3D51200000 test_align= =3D1' (200 runs): > > 17.348690 task-clock (msec) # 0.019 CPUs utilize= d ( +- 0.66% ) > 5 context-switches # 0.310 K/sec = ( +- 0.84% ) > 0 cpu-migrations # 0.009 K/sec = ( +- 16.55% ) > 74 page-faults # 0.004 M/sec = ( +- 0.11% ) > 45065287 cycles # 2.598 GHz = ( +- 0.66% ) > 60755389 instructions # 1.35 insn per cyc= le ( +- 0.05% ) > 14747865 branches # 850.085 M/sec = ( +- 0.05% ) > 19272 branch-misses # 0.13% of all branc= hes ( +- 0.13% ) > > 0.935251375 seconds time elapsed = ( +- 0.07% ) > > Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 = runs): > > 16626.042731 task-clock (msec) # 0.607 CPUs utilize= d ( +- 0.03% ) > 3291020 context-switches # 0.198 M/sec = ( +- 0.05% ) > 1 cpu-migrations # 0.000 K/sec = ( +- 0.50% ) > 85 page-faults # 0.005 K/sec = ( +- 0.16% ) > 30581044838 cycles # 1.839 GHz = ( +- 0.05% ) > 34962744631 instructions # 1.14 insn per cyc= le ( +- 0.01% ) > 6483883671 branches # 389.984 M/sec = ( +- 0.02% ) > 99624551 branch-misses # 1.54% of all branc= hes ( +- 0.17% ) > > 27.370305077 seconds time elapsed = ( +- 0.01% ) > > > *Before* this patchset: > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu= =3D16 test_pop_cpu=3D17 test_alloc_len=3D12 nr_test=3D51200000' (200 runs): > > 21.587934 task-clock (msec) # 0.005 CPUs utilize= d ( +- 0.72% ) > 6 context-switches # 0.281 K/sec = ( +- 0.28% ) > 1 cpu-migrations # 0.047 K/sec = ( +- 0.50% ) > 73 page-faults # 0.003 M/sec = ( +- 0.12% ) > 56080697 cycles # 2.598 GHz = ( +- 0.72% ) > 61605150 instructions # 1.10 insn per cyc= le ( +- 0.05% ) > 14950196 branches # 692.526 M/sec = ( +- 0.05% ) > 19410 branch-misses # 0.13% of all branc= hes ( +- 0.18% ) > > 4.603530546 seconds time elapsed = ( +- 0.11% ) > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu= =3D16 test_pop_cpu=3D17 test_alloc_len=3D12 nr_test=3D51200000 test_align= =3D1' (200 runs): > > 20.988297 task-clock (msec) # 0.006 CPUs utilize= d ( +- 0.81% ) > 7 context-switches # 0.316 K/sec = ( +- 0.54% ) > 1 cpu-migrations # 0.048 K/sec = ( +- 0.70% ) > 73 page-faults # 0.003 M/sec = ( +- 0.11% ) > 54512166 cycles # 2.597 GHz = ( +- 0.81% ) > 61440941 instructions # 1.13 insn per cyc= le ( +- 0.08% ) > 14906043 branches # 710.207 M/sec = ( +- 0.08% ) > 19927 branch-misses # 0.13% of all branc= hes ( +- 0.17% ) > > 3.438041238 seconds time elapsed = ( +- 1.11% ) > > Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 = runs): > > 17364.040855 task-clock (msec) # 0.624 CPUs utilize= d ( +- 0.02% ) > 3340375 context-switches # 0.192 M/sec = ( +- 0.06% ) > 1 cpu-migrations # 0.000 K/sec > 85 page-faults # 0.005 K/sec = ( +- 0.15% ) > 32077623335 cycles # 1.847 GHz = ( +- 0.03% ) > 35121047596 instructions # 1.09 insn per cyc= le ( +- 0.01% ) > 6519872824 branches # 375.481 M/sec = ( +- 0.02% ) > 101877022 branch-misses # 1.56% of all branc= hes ( +- 0.14% ) > > 27.842745343 seconds time elapsed = ( +- 0.02% ) > > Is this actually the numbers for this patch set? Seems like you have been using the same numbers for the last several releases. I can understand the "before" being mostly the same, but since we have factored out the refactor portion of it the numbers for the "after" should have deviated as I find it highly unlikely the numbers are exactly the same down to the nanosecond. from the previous patch set. Also it wouldn't hurt to have an explanation for the 3.4->0.9 second performance change as it seems like the samples don't seem to match up with the elapsed time data.