From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD692CA100A for ; Fri, 30 Aug 2024 16:27:00 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 210166B0198; Fri, 30 Aug 2024 12:27:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C0546B0199; Fri, 30 Aug 2024 12:27:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 060A36B019A; Fri, 30 Aug 2024 12:27:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D29996B0198 for ; Fri, 30 Aug 2024 12:26:59 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8F1C980791 for ; Fri, 30 Aug 2024 16:26:56 +0000 (UTC) X-FDA: 82509440832.11.AD68A38 Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by imf06.hostedemail.com (Postfix) with ESMTP id 5C115180017 for ; Fri, 30 Aug 2024 16:26:53 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="XJ4JJxO/"; spf=pass (imf06.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725035192; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MjdIvVpE9p/+QONCePZXMAOpmCzdJZMhKHA48QmkhTo=; b=5XI5DrazZ9vsv+ox4kpx8xn7o2fOSGeqzcaLYmXVd8UfLLl8hl3vQTgN0pugFbW5ahyUKw DxgZTN0Vdbm2++8SJPpWgiQYjr6n2Ld+FM9RRgYFMPf0UR4r2GyVigGS35TVSge9tTDRci /2IHqpNxhuLzztiLaAb4bvr6VIqoFsM= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="XJ4JJxO/"; spf=pass (imf06.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725035192; a=rsa-sha256; cv=none; b=aSOJGuuxTI97YdUjVNtQYuvTJyyrqzSzWfh3oax6QAQpS605TPelydDWT/P1RsYj3fCQg0 2yrNJTXNDfpNfN66BSNXs2f4bhcRZNjPPicrsJVYKOpf6uVHkyTZt0JyWoA6HZlANOEoSY F+SBJQAoSSyQa0brQHGX5hLvcofv0/Y= Received: by mail-lj1-f179.google.com with SMTP id 38308e7fff4ca-2f4f2cda058so27458851fa.1 for ; Fri, 30 Aug 2024 09:26:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725035211; x=1725640011; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from:from:to :cc:subject:date:message-id:reply-to; bh=MjdIvVpE9p/+QONCePZXMAOpmCzdJZMhKHA48QmkhTo=; b=XJ4JJxO/R7Z+TSnG9BMXp6Ayx0TBUpha911g5H2EAQnhURcfWSA9E4EABOh37yoFLV r5TA7dgUY6P4hk4BiMKvVN2qvjrkUvKM5OnX7yDeWTOghk0LGWStWp6ge525nT0Xw627 e9MybT8mldjDD9r5/eM4Ad0DFaf+RiEQY+SDVFkIX49+PSKiBRH/NMZ3umAxtsGLTNZ5 +8JuSoFEU2eP+SWHadQVgy/utxK31JAHF8sWTqrWeQc4M1ScJDi37vEHDJv43jiS69D7 n4kXwbR1B3SJAWjBLcEhLplaaxuEiaqA/5ZE3o4IbBrZQjIyx/IUNiBAHuH5HxViDE8B AIAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725035211; x=1725640011; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=MjdIvVpE9p/+QONCePZXMAOpmCzdJZMhKHA48QmkhTo=; b=FA/DPa7hrilAfBODUdYvM/K5zZPMyoUbl6ZW0np48eDsvwkBkGeElZ7Qos/Zf+U3gj 5873jU66eFQyEFiY6RyUxBXQ48upwNt/cNqdbzt/+h3L94JLtfPoLpiJpZrK6qqj074v E1C8Z9WikAL8I7tFatXuPQi884e9FSb7KDBxe5vLlc6osrTepKSpTtuImyzn6DuFnvlc FkVattrNJVhcY8EtNXkTltP2U3anJTi/r9WgK3Y4X96j8S3QbI6drRPL66D3bIWfMEcV RHXWF82H38MdKm1CVYfWtxiu6yqKHvzfg2G6sAmRRZ6k3WwYlT35uhmOPdLnDeyy6liF EQtQ== X-Forwarded-Encrypted: i=1; AJvYcCWsDyumOgNqR16I4Y+28GPnzMXZZRgSBfEBNzbA4FDRsg0ZzhO/9JSBmE5wEVpQYXaEVSl6ywxtgA==@kvack.org X-Gm-Message-State: AOJu0YwZc8bGqh/3jfC2h+epuCMY+yCNNmYecFZ3HV2TtlTWq5BZcurF AqmvVb0rnVAMr4fSwKca1BQcxJ+InUEHS8+1IRGPJu3TGqirGENf X-Google-Smtp-Source: AGHT+IEcEIz7ZNdDlyBVNUrIYU+O0lhmgPte1yqOLDz+hPu9aN2cML+Lk4mgHz8LTQZcNgLlk+cFww== X-Received: by 2002:a05:651c:198a:b0:2ef:2bea:69a with SMTP id 38308e7fff4ca-2f61e040552mr11665721fa.2.1725035209762; Fri, 30 Aug 2024 09:26:49 -0700 (PDT) Received: from pc636 (host-90-233-206-146.mobileonline.telia.com. [90.233.206.146]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-2f614f23a93sm7345121fa.51.2024.08.30.09.26.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 30 Aug 2024 09:26:49 -0700 (PDT) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Fri, 30 Aug 2024 18:26:46 +0200 To: Adrian Huang Cc: Adrian Huang , Andrew Morton , Christoph Hellwig , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Adrian Huang Subject: Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area Message-ID: References: <20240829130633.2184-1-ahuang12@lenovo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Stat-Signature: jpymrzntp86jqf8ke7u5u9uxffzxr8p4 X-Rspamd-Queue-Id: 5C115180017 X-Rspamd-Server: rspam11 X-HE-Tag: 1725035213-686156 X-HE-Meta: U2FsdGVkX1+8gl0ZLzapyeW4597zD0gIUrVnQzip42dnGrNfBrAXg6mO+oH3V1YE80ONbztQUFTNlS86pt6fqJNQi6CFcKm5UtJaI/wkKfw7jyQyumHNZdCvjYtKHen048UbLVIMiCnIXkEGVW018/KkO/OcVx/wUTO+XfqqTiT7dKdyW0QWTH8dFvmWCI0ncBee2aFqxzZCE+YsCHKSa89/3BkRRqO6TcjayHVdivfFNm+wjQKiFd7bcIfBrrblijdj2xeb+rdsm759+YSl5KHougLM32H+7B6U8+cutWp84sqSJ/d70/5QE4HgWEVGupRRkj7HdaG3V0kX8lHNf6ljcSfiiieTCLBNi7WDgYskOub0bll2LM6SC2eEnzcOrmfim420dVM1vkjRMgFJT2sC9nIz8t+80Y5y0UUKZ7U3yAK+zkyhG1YSkkCpK9ZAMu/v1yrXKRvD63at443XL9FiG7v3YvfpD4BRAOlzAFgTgrgftoDw/5QfjV2i+gMH+DRiehsyHpLibONGB1g24PvWSzmoSizEUY02eSTsAzVx5t+7kPQvtRqNCk5oq+5Tio7ie7FLQetOZ+3TunbI8MNf0TMnrOEu8vXwTD73Lsiod96tP9IiTS5eXy6v1ZzJlXlu/ES1smeeqUFFMTtRtR/bN+0+zKbhT0mj0yJG+gllFw9c0QuI5mbC7o38tRGRlXepbxaoHaagjFlyRrVkFmxSVUeGiW7I0yBLRZWJ9Kf43hzkM+e/BnhY5vilS5OWisxLNBj4U2jE+V1n0uJeKfnd/mu6W3VqHDiMCluYqpz3d5E2Wns2r1t+DY2NGF4nTgwwjPpMUnaY5G6b9KAiUc/MqiO2XkQxCCXd9nf8ea88T5mS5m8j5fHjHG0ZkAqat8B6M2r8SOkGVf3dBNoPGayV/f/nNipfx3opLfw7XeYCRX6uMTgnph2Z1hAriUvSG4XhwL0Q1TWwnzmdnmJ TaDR4Rel 3Xvf8/3Qhe5uU7GFidkAfN2P3awzp5O6p/aDICm6yPuXyOm5DoJFOv7rvPojucCFuNkFtGSmdrf2nB2sxdPQUnBnA0B0V0pGt4IbUj92icG32rwJUkh74sOb/+8zEeW26l01B27z/kYXC+EzBUtBPDdqsFhNnRseENYAbcM/8U9Q+su7oX26FCLVH3e+CXri75ZF5pqegJ+R5izwAQjOWfS1ODNtzSy8Qau3pmY5YkBoC9Fcsma0zrLg9LiXR+bjt8bf1NHPMEaF0nVIY/sq/5xrrqwsxzLQZKWGFLcmgPVZe1vNRWlJbbjBIwLj4DAoKUxxwEQtuVnhr3Ok0WQDU7eNSNSGNJSffuNW1hfauA0XQKTJvACdCF2ShvewZY1Clq1c/oltx05fF0EwOYOq0DCH2wKYvmMtWMQY1HWL9z00p6Nt1IvwVFE38DazhOvUPvIneip7a9+WCRrFWGJy8KWj4qoQ0Ouis7S7bAzFFEdyyCD/T+M8utiCjJjukc/Oqk9Z17IkHmLOd67f35VV0yFNGy+f6dFIXJTDAdWFdHlsf7rpTLOYbB/7m6A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 29, 2024 at 09:00:16PM +0200, Uladzislau Rezki wrote: > On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote: > > From: Adrian Huang > > > > When running the vmalloc stress on a 448-core system, observe the average > > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc > > 'funclatency.py' tool [1]. > > > > # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 > > > > usecs : count distribution > > 0 -> 1 : 0 | | > > 2 -> 3 : 29 | | > > 4 -> 7 : 19 | | > > 8 -> 15 : 56 | | > > 16 -> 31 : 483 |**** | > > 32 -> 63 : 1548 |************ | > > 64 -> 127 : 2634 |********************* | > > 128 -> 255 : 2535 |********************* | > > 256 -> 511 : 1776 |************** | > > 512 -> 1023 : 1015 |******** | > > 1024 -> 2047 : 573 |**** | > > 2048 -> 4095 : 488 |**** | > > 4096 -> 8191 : 1091 |********* | > > 8192 -> 16383 : 3078 |************************* | > > 16384 -> 32767 : 4821 |****************************************| > > 32768 -> 65535 : 3318 |*************************** | > > 65536 -> 131071 : 1718 |************** | > > 131072 -> 262143 : 2220 |****************** | > > 262144 -> 524287 : 1147 |********* | > > 524288 -> 1048575 : 1179 |********* | > > 1048576 -> 2097151 : 822 |****** | > > 2097152 -> 4194303 : 906 |******* | > > 4194304 -> 8388607 : 2148 |***************** | > > 8388608 -> 16777215 : 4497 |************************************* | > > 16777216 -> 33554431 : 289 |** | > > > > avg = 2041714 usecs, total: 78381401772 usecs, count: 38390 > > > > The worst case is over 16-33 seconds, so soft lockup is triggered [2]. > > > > [Root Cause] > > 1) Each purge_list has the long list. The following shows the number of > > vmap_area is purged. > > > > crash> p vmap_nodes > > vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000 > > crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged > > nr_purged = 663070 > > ... > > nr_purged = 821670 > > nr_purged = 692214 > > nr_purged = 726808 > > ... > > > > 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic > > operation when purging each vmap_area. However, the iteration is over > > 600000 vmap_area (See 'nr_purged' above). > > > > Here is objdump output: > > > > $ objdump -D vmlinux > > ffffffff813e8c80 : > > ... > > ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip) > > ... > > > > Quote from "Instruction tables" pdf file [3]: > > Instructions with a LOCK prefix have a long latency that depends on > > cache organization and possibly RAM speed. If there are multiple > > processors or cores or direct memory access (DMA) devices, then all > > locked instructions will lock a cache line for exclusive access, > > which may involve RAM access. A LOCK prefix typically costs more > > than a hundred clock cycles, even on single-processor systems. > > > > That's why the latency of purge_vmap_node() dramatically increases > > on a many-core system: One core is busy on purging each vmap_area of > > the *long* purge_list and executing atomic_long_sub() for each > > vmap_area, while other cores free vmalloc allocations and execute > > atomic_long_add_return() in free_vmap_area_noflush(). > > > > [Solution] > > Employ a local variable to record the total purged pages, and execute > > atomic_long_sub() after the traversal of the purge_list is done. The > > experiment result shows the latency improvement is 99%. > > > > [Experiment Result] > > 1) System Configuration: Three servers (with HT-enabled) are tested. > > * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1 > > * 192-core server: 5th Gen Intel Xeon Scalable Processor*2 > > * 448-core server: AMD Zen 4 Processor*2 > > > > 2) Kernel Config > > * CONFIG_KASAN is disabled > > > > 3) The data in column "w/o patch" and "w/ patch" > > * Unit: micro seconds (us) > > * Each data is the average of 3-time measurements > > > > System w/o patch (us) w/ patch (us) Improvement (%) > > --------------- -------------- ------------- ------------- > > 72-core server 2194 14 99.36% > > 192-core server 143799 1139 99.21% > > 448-core server 1992122 6883 99.65% > > > > [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py > > [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 > > [3] https://www.agner.org/optimize/instruction_tables.pdf > > > > Signed-off-by: Adrian Huang > > --- > > mm/vmalloc.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 3f9b6bd707d2..607697c81e60 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work) > > { > > struct vmap_node *vn = container_of(work, > > struct vmap_node, purge_work); > > + unsigned long nr_purged_pages = 0; > > struct vmap_area *va, *n_va; > > LIST_HEAD(local_list); > > > > @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work) > > > > list_del_init(&va->list); > > > > - atomic_long_sub(nr, &vmap_lazy_nr); > > + nr_purged_pages += nr; > > vn->nr_purged++; > > > > if (is_vn_id_valid(vn_id) && !vn->skip_populate) > > @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work) > > list_add(&va->list, &local_list); > > } > > > > + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr); > > + > > reclaim_list_global(&local_list); > > } > > > > -- > > 2.34.1 > > > I see the point and it looks good to me. > > Reviewed-by: Uladzislau Rezki (Sony) > > Thank you for improving this. There is one more spot which i detected > earlier, it is: > > > static void free_vmap_area_noflush(struct vmap_area *va) > { > unsigned long nr_lazy_max = lazy_max_pages(); > unsigned long va_start = va->va_start; > unsigned int vn_id = decode_vn_id(va->flags); > struct vmap_node *vn; > unsigned long nr_lazy; > > if (WARN_ON_ONCE(!list_empty(&va->list))) > return; > > nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> > PAGE_SHIFT, &vmap_lazy_nr); > > ... > > > atomic_long_add_return() might also introduce a high contention. We can > optimize by splitting into more light atomics. Can you check it on your > 448-cores system? > I have checked the free_vmap_area_noflush() on my hardware. It is 64 cores system: ... + 7.84% 5.18% [kernel] [k] free_vmap_area_noflush + 6.16% 1.61% [kernel] [k] free_unref_page + 5.57% 1.51% [kernel] [k] find_unlink_vmap_area ... .. │ arch_atomic64_add_return(): 23352402 │ mov %r12,%rdx │ lock xadd %rdx,vmap_lazy_nr │ is_vn_id_valid(): 52364447314 │ mov nr_vmap_nodes,%ecx <----- the hotest spot which consumes the CPU cycles the most(99%) │ arch_atomic64_add_return(): 45547180 │ add %rdx,%r12 │ is_vn_id_valid(): ... At least in my case, HW, i do not see that atomic_long_add_return() is a top when it comes to CPU cycles. Below one is the hottest instead: static bool is_vn_id_valid(unsigned int node_id) { if (node_id < nr_vmap_nodes) return true; return false; } access to "nr_vmap_nodes" which is read-only and globally defined: static __read_mostly unsigned int nr_vmap_nodes = 1; Any thoughts? -- Uladzislau Rezki