From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 65ABCC3601E for ; Fri, 11 Apr 2025 02:17:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3723C28014A; Thu, 10 Apr 2025 22:17:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 321FC280138; Thu, 10 Apr 2025 22:17:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E93728014A; Thu, 10 Apr 2025 22:17:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 008C3280138 for ; Thu, 10 Apr 2025 22:17:01 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 451E51618D3 for ; Fri, 11 Apr 2025 02:17:02 +0000 (UTC) X-FDA: 83320150284.29.63338F2 Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by imf14.hostedemail.com (Postfix) with ESMTP id BC6A9100002 for ; Fri, 11 Apr 2025 02:16:59 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=J5bXCIKC; spf=pass (imf14.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744337820; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cxljmE/yofVW+Uo++SXPrG3WA9fAaQSkQEVDTJyt6tQ=; b=4JJswCD0huX6w6VgQ7lSMyiY4yxvW0zu5Cma3o3Lnc8DegbXV5otSVbiZlJx+NbTU4XFRh RwV1N0/SRvZGqG171WfleCYuTPhmEmdbs25qR+X0ln6TtgcHfUGavpEpu0Cbsug7GC9TM5 L+Q8cuhIXDnQLoRhjZipJWj0im4ZU74= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744337820; a=rsa-sha256; cv=none; b=adYZmbtcmFyp/NueDJCKHITu8BjldKWUDQDSBFfxNOvLWm6PXtYJ5tm4s1AtNEe7Sv5MQr NdcFHKeB5ndh0yPmgFa9KfffR9aFdfQhiO9z02D2b9xA9tcTvMYilotKkpTnHFJAr1cmFo zHqMqI1F+4M5fys1KynQdv/FvS1K7pE= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=J5bXCIKC; spf=pass (imf14.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744337816; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type; bh=cxljmE/yofVW+Uo++SXPrG3WA9fAaQSkQEVDTJyt6tQ=; b=J5bXCIKCNlRuWsnyK+nCq+Wpa7CGx+Q6K9DXKgXzXSeRTf9U0vdPVim9zv+uAbCR1+Y/tp27cCmxjmSZWqPV0+YxYmvsDFkXY81R2umOI1gJpwMHEf4dF4p1PnemvDbEMZbpo1poovYLyZX8vgsfwHS7GfOjhHZly/qPn2VHcsA= Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0WWRGYin_1744337814 cluster:ay36) by smtp.aliyun-inc.com; Fri, 11 Apr 2025 10:16:55 +0800 From: "Huang, Ying" To: Nikhil Dhama Cc: , , , , , Huang Ying , , , Mel Gorman Subject: Re: [PATCH v3] mm: pcp: increase pcp->free_count threshold to trigger free_high In-Reply-To: <20250407105219.55351-1-nikhil.dhama@amd.com> (Nikhil Dhama's message of "Mon, 7 Apr 2025 16:22:19 +0530") References: <20250407105219.55351-1-nikhil.dhama@amd.com> Date: Fri, 11 Apr 2025 10:16:54 +0800 Message-ID: <87mscn8msp.fsf@DESKTOP-5N7EMDA> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: BC6A9100002 X-Stat-Signature: 7p5sa95znf85e6zu87qstrpqe96k46gm X-HE-Tag: 1744337819-93471 X-HE-Meta: U2FsdGVkX19pASOc7IxA55jDWJPxSZOn4kuODRWLz9DTWO5szskFq7nvnu3dKdnwtf6WYwmne1YoIbjtYSkkdWApyrJu3MmStQFzhlClI9kDpdTmo0FO0m41VUabykPiKUXrOv1ccoFl5qI+BkvNXbBdzlhUlGh6KfgGNv90iag5VylFkb3y/lpYCogpcCaf5laolIF3NMS4W8sjP5yI86Sw0dXEgDiyLogFO3ac9BsdAoXZp3JPrtzIFBG9e34f/wZ+eLLzyd4xQWD1nV4cDoQovDEWgBLVM82tI0lW7Z8XhR6RrLy6DINj39Fvewv5+Za0QUvAjoLHeTX+Hlckr92ykF9TcTUA5smBqc38uoyYlfjBTPClm9p8owfrThCh0w28C89WBK9sBwI/MNugLxJ6sPVOiziGDat9y2ThbAdJBhbAcUwzLlzkA8oQriG7BIt1CmGFSaSuVk3oD7IURJ3nIX14cPkBjmftOdHCB3IjngYvL4Wn22s47PY+i0ic4nfp1uGojStxyrAiyIkdYgX7gvXeE3hWWJgukygK0oCOkGK9yWW740HQJfHZMEEMYDJz5WpFNHLOivtDrCeBQGirOtTxn1qjA3JOtMI/WSW35xFu/7H/s9BtF3SBZRlWP6xRJw87gAha4TP6QeUgnyyHwQ48Dyv18piCn238zOaMsIoED2CL1kf+QCYJBrQE8/nuEXRgATsm+EEiaMyvJscxDhTddnOr34CS37JTd3t6C1pdQ7S4WypVunLYE7afi+oHNuFjRRXS3D/ZSw0SeFtcL3MaUa0tCD+1CrJdHBFWNgqyt93J/udN4PNnwBp5CDFTbaSLh4+o25Cupy6KqGbl10GT1fVdy4CH3HgGjHVH1knA+1EMmeaAQfExVFuAHXKLMKKU9wtUEVzN8GZvz5kbTFTcO/MZc5b3zC21OBcygXjLbb18IjZR1N5sUT5Ci1aciPcuWvXQTduwi30 gXcUQ917 qwGeLCDho+0Vu6r0IGlPG1en/EWG9BGFayNyPBy/qU+6cUloP0NxfSUlR5CjS72t3m5HyQ8RHgm1oozB7UsaFliJLV+QFDlXJ4fDd05YewDlnePj6+dQ0epA5JkertAspm0bSlPs/Kxq3IHCV7oSsBP8DWWk68TcB6mKWD8NMEy6ktvC1U5Kv7GBEei1lA4fQwRISw7E2nqPvx5r7D070b/evS+BePOlAmVvCFc1zLc/OmYnbRF3za9Sn97T0KikOuDnMNDxVzwYhaYZdYOyW13uYfWkLpZp5K11cB0xqqc2/HqyaxKFHapEJqbBIfhEy4yb53u+UerX4KgNsiV/GN8ZQgJUkh4BV/3VQyKlWjwOAWmSGD1eyu7wUKafemP59o+xTOiXD9MvPSNxgLYsJ9Eydm6PdfRfNAxFaLBDs74o1KUFSuKnoRG7X5IaHZALtUwmVUYxq/8qp+ekIL8XwkT+OEW4OQWqc/7tL4LqvFG1XNYHMsDI2OhloxPa2JCgSkoeLHKEwNeWhKbamhqqh45wVqAAN6reHTk8jnt//uDN+Bva3PJLDGBvahf0JP6b4K8nm0GtnLdejBKhR/dOQqGujespOIJ+9hu7X X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi, Nikhil, Sorry for late reply. Nikhil Dhama writes: > In old pcp design, pcp->free_factor gets incremented in nr_pcp_free() > which is invoked by free_pcppages_bulk(). So, it used to increase > free_factor by 1 only when we try to reduce the size of pcp list or > flush for high order, and free_high used to trigger only > for order > 0 and order < costly_order and pcp->free_factor > 0. > > For iperf3 I noticed that with older design in kernel v6.6, pcp list was > drained mostly when pcp->count > high (more often when count goes above > 530). and most of the time pcp->free_factor was 0, triggering very few > high order flushes. > > But this is changed in the current design, introduced in commit 6ccdcb6d3a74 > ("mm, pcp: reduce detecting time of consecutive high order page freeing"), > where pcp->free_factor is changed to pcp->free_count to keep track of the > number of pages freed contiguously. In this design, pcp->free_count is > incremented on every deallocation, irrespective of whether pcp list was > reduced or not. And logic to trigger free_high is if pcp->free_count goes > above batch (which is 63) and there are two contiguous page free without > any allocation. The design changes because pcp->high can become much higher than that before it. This makes it much harder to trigger free_high, which causes some performance regressions too. > With this design, for iperf3, pcp list is getting flushed more frequently > because free_high heuristics is triggered more often now. I observed that > high order pcp list is drained as soon as both count and free_count goes > above 63. > > Due to this more aggressive high order flushing, applications > doing contiguous high order allocation will require to go to global list > more frequently. > > On a 2-node AMD machine with 384 vCPUs on each node, > connected via Mellonox connectX-7, I am seeing a ~30% performance > reduction if we scale number of iperf3 client/server pairs from 32 to 64. > > Though this new design reduced the time to detect high order flushes, > but for application which are allocating high order pages more > frequently it may be flushing the high order list pre-maturely. > This motivates towards tuning on how late or early we should flush > high order lists. > > So, in this patch, we increased the pcp->free_count threshold to > trigger free_high from "batch" to "batch + pcp->high_min / 2". > This new threshold keeps high order pages in pcp list for a > longer duration which can help the application doing high order > allocations frequently. IIUC, we restore the original behavior with "batch + pcp->high / 2" as in my analysis in https://lore.kernel.org/all/875xjmuiup.fsf@DESKTOP-5N7EMDA/ If you think my analysis is correct, can you add that in patch description too? This makes it easier for people to know why the code looks this way. > With this patch performace to Iperf3 is restored and > score for other benchmarks on the same machine are as follows: > > iperf3 lmbench3 netperf kbuild > (AF_UNIX) (SCTP_STREAM_MANY) > ------- --------- ----------------- ------ > v6.6 vanilla (base) 100 100 100 100 > v6.12 vanilla 69 113 98.5 98.8 > v6.12 + this patch 100 110.3 100.2 99.3 > > > netperf-tcp: > > 6.12 6.12 > vanilla this_patch > Hmean 64 732.14 ( 0.00%) 730.45 ( -0.23%) > Hmean 128 1417.46 ( 0.00%) 1419.44 ( 0.14%) > Hmean 256 2679.67 ( 0.00%) 2676.45 ( -0.12%) > Hmean 1024 8328.52 ( 0.00%) 8339.34 ( 0.13%) > Hmean 2048 12716.98 ( 0.00%) 12743.68 ( 0.21%) > Hmean 3312 15787.79 ( 0.00%) 15887.25 ( 0.63%) > Hmean 4096 17311.91 ( 0.00%) 17332.68 ( 0.12%) > Hmean 8192 20310.73 ( 0.00%) 20465.09 ( 0.76%) > > Fixes: 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing") > > Signed-off-by: Nikhil Dhama > Suggested-by: Huang Ying > Cc: Andrew Morton > Cc: Huang Ying > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > Cc: Mel Gorman > > --- > v1: https://lore.kernel.org/linux-mm/20250107091724.35287-1-nikhil.dhama@amd.com/ > v2: https://lore.kernel.org/linux-mm/20250325171915.14384-1-nikhil.dhama@amd.com/ > > mm/page_alloc.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b6958333054d..569dcf1f731f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, > * stops will be drained from vmstat refresh context. > */ > if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { > - free_high = (pcp->free_count >= batch && > + free_high = (pcp->free_count >= (batch + pcp->high_min / 2) && > (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && > (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || > pcp->count >= READ_ONCE(batch))); --- Best Regards, Huang, Ying