From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 709D3C87FD3 for ; Thu, 29 Aug 2024 19:00:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D93616B008A; Thu, 29 Aug 2024 15:00:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D43DF6B00B6; Thu, 29 Aug 2024 15:00:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BBD6B6B00B7; Thu, 29 Aug 2024 15:00:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id A05016B008A for ; Thu, 29 Aug 2024 15:00:25 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 58A991C2006 for ; Thu, 29 Aug 2024 19:00:25 +0000 (UTC) X-FDA: 82506198810.04.317C207 Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by imf05.hostedemail.com (Postfix) with ESMTP id 3C95F100022 for ; Thu, 29 Aug 2024 19:00:21 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="COHsS/Jy"; spf=pass (imf05.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724957903; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SGK7eMY9Zj/tIqpAUX9qRhLNxPL21zjXLvtek+CF4Oo=; b=eUfxQ0pQDJ+FLc9mWn3Be1IBSrJh0lXvrWdJf2IyRDNcyRqTyDy41PfoVN34b22BYEEbyn P+bJX/ODfAW8y8hIoHVXAlqomj0hBFvHYLQx3RBNYZzElY/kjLu3vdTMIXYFqfLHySlwxQ /aE+4B8Ud0dujpGoREqzs2qmcK5ICbM= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="COHsS/Jy"; spf=pass (imf05.hostedemail.com: domain of urezki@gmail.com designates 209.85.208.169 as permitted sender) smtp.mailfrom=urezki@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724957903; a=rsa-sha256; cv=none; b=TBWLmt31mq0fZwkAT6X+oGH11/V1ZWtMwdCKZhV+Ro7jPpYFvn4v7CPWtw7Wc+1nzpykfy ELvr2PMX4fCipgvfdRWdYz6p7Rh4dC7lbZ4ldI6rh6MpV3uT9K6DuW9BWCoQZwy5efh/oW wVhOVbUr/bGAT9zdDiurgaveN9TfZts= Received: by mail-lj1-f169.google.com with SMTP id 38308e7fff4ca-2f50f1d864fso10860131fa.1 for ; Thu, 29 Aug 2024 12:00:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1724958020; x=1725562820; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:from:to:cc:subject:date:message-id:reply-to; bh=SGK7eMY9Zj/tIqpAUX9qRhLNxPL21zjXLvtek+CF4Oo=; b=COHsS/JyAa3u86Pf0XIshNNZPJU3I/qnXXn0hvJTZiPhss/YuygJW26TS+J7Sih5Xz 9tK2OJhdsJcNAndee2UnWXf/WkyhFAU/lTeuIMPY1lXTBy45P4K/IEtFdnJLabSUPzTW a/M+xdJ1pX70JVovtu068udFFHwxLts+tCVBXsaPObXKsR+7PnUm5WjCcPsNHZwr2gh2 2BZZCFA1QzTKmJQC2VCeIweJ8ciVbPgbDUDioOqPjcecwAHjVKql7klYztnTV+kulRae U6AyW6t9vMTTSalisJAi26jEnT3rUv1wYCzl40STum1vLt3EHsUO1pEVRXg71Yg4NC5D AXNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724958020; x=1725562820; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:date:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=SGK7eMY9Zj/tIqpAUX9qRhLNxPL21zjXLvtek+CF4Oo=; b=XV5zMb3mCIFryhDVCQIHzFAZCFh8HqQXgyUKIKEEfIHihu6yl6rrC7VCcaRv2Rl7ap LFDjvTYojeUpKoE8KEIqb+aJRreTh0jVkk/zv2ptr0vPPE6Gx8zASSu58ppOYQIBx1xC LvLNJaeYw3O42ZoEIyLtBFhYoW4AxCsYQEUI7UyF9xs1kZYAswVnLqj273VY3C3NhE2b 9kVWVjhjbcAi58eI+KCyY/zLjL7zJGONoItfiFvEMuorRsOG4knY4G0ADB8cRAR/0yFR UjZ1NobJlMbr2k92/Yv4DPfP6N+LQkXoQhKeWBC5P2gYQclpeIUJB3fC9RuzrrdMNdkQ oS9A== X-Forwarded-Encrypted: i=1; AJvYcCUp9rQigUo9+bZltMvuiTBeaq3Qbsu816u1eI9Csr7TlIZbOTbLP4usfDr62Ptcp0Di071p99OnHA==@kvack.org X-Gm-Message-State: AOJu0YycopTHKECfI6X9kY2F6fZjU6R7IU8qBrqKv+Bjxb/WbeOTxu28 tQ3Ec6EusHcnPhYGHzC2OOxoOJR5eY+emM07iGZpxE1kJ5y9wENV X-Google-Smtp-Source: AGHT+IED3UvWdS3LUGeMKBT0q4MRSQkckAiakMvnDj76PzAoFQvzNwmQAXegGpdZF8gcKKhdM2pVqg== X-Received: by 2002:a05:6512:2383:b0:535:3d08:5844 with SMTP id 2adb3069b0e04-5353e545de9mr2747725e87.6.1724958019237; Thu, 29 Aug 2024 12:00:19 -0700 (PDT) Received: from pc636 (84-217-131-213.customers.ownit.se. [84.217.131.213]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5354082750fsm223528e87.123.2024.08.29.12.00.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 Aug 2024 12:00:18 -0700 (PDT) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Thu, 29 Aug 2024 21:00:16 +0200 To: Adrian Huang Cc: Andrew Morton , Uladzislau Rezki , Christoph Hellwig , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Adrian Huang Subject: Re: [PATCH 1/1] mm: vmalloc: Optimize vmap_lazy_nr arithmetic when purging each vmap_area Message-ID: References: <20240829130633.2184-1-ahuang12@lenovo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240829130633.2184-1-ahuang12@lenovo.com> X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 3C95F100022 X-Stat-Signature: 6jmfufytqfnk1tzboxw88e91b4thpke6 X-Rspam-User: X-HE-Tag: 1724958021-245555 X-HE-Meta: U2FsdGVkX18gQZQiFJa7/vrYxOPvZGUuECsJqQZnoyj+moFl2AoG/HXnYdwqH3N4bR+4vtyetaO5koK5FZd9GvpTrNKZysMsHKrPJf62f8j5f6ehMMpSN5QMGmd1W98/NO6wtIZ82AV/WD0s9mHviscgGbZ8lScONR+9Bme+3KHSCtA2jDJbTZHNdap0zZmN87r9ul2jVlGSd9svaKi8PKU2vH+7/wC9EG4WWc1umqaZw9HQxGLquFOO26I8UaHpJPkG+mt/YhV86oLkfu6/QN8kmP8RHtvfokX4FSbFhDUDN++plvPIgYxx5tlmDOEpWlPVxMyhQu/tXgRpPUFiSOIqgc9wl/jS8dwe1LDhpr2l/Kh6mrqJNsyQe6yGHED62fLyeleG1ovF/MRH/Xjwo8CYh/417D4Xg0V3RCj/i3FgkmrSqQ2EvTBqWaciVMaZr0mvavyQI0/enqpcAABJBJeubqcaIeNX4ob1yeL9YwhDRkpqWvpMxcxkIqHCcJbqm48ygEKYwT5oGFxrpYTUNfsk3JRNQA0WfByLZOsTLh8IAS971pnVJqgkRvFKJK8nNh/yCF7/Mfyy9cDn0hLLBkMo7JjcO6qdgHApZjkjuNgNGSDVY2RmX+MEALn+qnGruGM1/eKLpj0zV8L4ifgJO5ALPmXdMA5OBc9JHIFfbYW8FBMMzmYJiAVPMxIlRONrpHVYs7++lTBesvgOJCjsyPPTkxotnauBHb8hrwB2YvrqehSwbyWzIYjWTpS/IOyLRY7fL9zRKmW3ygBcnBiJxhek0Sz5PJO1UZSSJJB9ya3tWDrIN5I8Vi33b9cnfFsxg49rybb6y4FMCzmCskFkWaZWMTktiJ4VF3KHXrpve5hCYHQLi805Ut5GDPIzkcLBD7bHf9bQ8hlFiIXxQB0W8Dtn7yeIepL2C0pqEsbvt4av4U6EoDLoqAE5QsnnADTp9O+M0NhNCjHM8nldpUq jDF7Izfz oSyKnydkEMYGnrHJPa+N/6FOgXrMeAeLEM6667Q5HG1fT/fN0UpYiqGRwSzW2OLrQ6ojGGgqzE93OK7obByX5FnDldwzBRc95UHu0sWK5Irpx0M7zRrlkiuWx3Ltxr1cqOfHpAOAkYXbaR6lisn46X4b2nXfwHnPBY+GQSQPI+dtc+0K2iAIzREg/6H3hKe3cw0Bboy+FZSMpwE16MIxt3WmI+H/OdsrBb6FC3Z729WJRb6bEm14ViO/jJKti+k1E/7tD65s1oDnv6GiON6yQYRBVwBIAqtAS3VsHi7fBLiybZMv0MktoR0ycBs5ASbNOfiko3pOmV18VvNegdbFGZ3E66VE8aezDZ8Se2Ja+sgBnSglHjCIUw088fmlpiZqxwnAZCMFc05Nsj+4lXaBKmuUhaABzL3nncAVRyk2vEyNEk7z569idrL+Fu8/FVj6aoSxFRJVEQbJCVO7/sJkpbuURGAZuX3kWQwFndoFT8EhPhkRqS0abPXTX7aAbXX+Dszmc0+tx5RUmDfDcIyQkZxKYMAkZvX22mXU8 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Aug 29, 2024 at 09:06:33PM +0800, Adrian Huang wrote: > From: Adrian Huang > > When running the vmalloc stress on a 448-core system, observe the average > latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc > 'funclatency.py' tool [1]. > > # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node & pid1=$! && sleep 8 && modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1 > > usecs : count distribution > 0 -> 1 : 0 | | > 2 -> 3 : 29 | | > 4 -> 7 : 19 | | > 8 -> 15 : 56 | | > 16 -> 31 : 483 |**** | > 32 -> 63 : 1548 |************ | > 64 -> 127 : 2634 |********************* | > 128 -> 255 : 2535 |********************* | > 256 -> 511 : 1776 |************** | > 512 -> 1023 : 1015 |******** | > 1024 -> 2047 : 573 |**** | > 2048 -> 4095 : 488 |**** | > 4096 -> 8191 : 1091 |********* | > 8192 -> 16383 : 3078 |************************* | > 16384 -> 32767 : 4821 |****************************************| > 32768 -> 65535 : 3318 |*************************** | > 65536 -> 131071 : 1718 |************** | > 131072 -> 262143 : 2220 |****************** | > 262144 -> 524287 : 1147 |********* | > 524288 -> 1048575 : 1179 |********* | > 1048576 -> 2097151 : 822 |****** | > 2097152 -> 4194303 : 906 |******* | > 4194304 -> 8388607 : 2148 |***************** | > 8388608 -> 16777215 : 4497 |************************************* | > 16777216 -> 33554431 : 289 |** | > > avg = 2041714 usecs, total: 78381401772 usecs, count: 38390 > > The worst case is over 16-33 seconds, so soft lockup is triggered [2]. > > [Root Cause] > 1) Each purge_list has the long list. The following shows the number of > vmap_area is purged. > > crash> p vmap_nodes > vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000 > crash> vmap_node 0xff2de5a900100000 128 | grep nr_purged > nr_purged = 663070 > ... > nr_purged = 821670 > nr_purged = 692214 > nr_purged = 726808 > ... > > 2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic > operation when purging each vmap_area. However, the iteration is over > 600000 vmap_area (See 'nr_purged' above). > > Here is objdump output: > > $ objdump -D vmlinux > ffffffff813e8c80 : > ... > ffffffff813e8d70: f0 48 29 2d 68 0c bb lock sub %rbp,0x2bb0c68(%rip) > ... > > Quote from "Instruction tables" pdf file [3]: > Instructions with a LOCK prefix have a long latency that depends on > cache organization and possibly RAM speed. If there are multiple > processors or cores or direct memory access (DMA) devices, then all > locked instructions will lock a cache line for exclusive access, > which may involve RAM access. A LOCK prefix typically costs more > than a hundred clock cycles, even on single-processor systems. > > That's why the latency of purge_vmap_node() dramatically increases > on a many-core system: One core is busy on purging each vmap_area of > the *long* purge_list and executing atomic_long_sub() for each > vmap_area, while other cores free vmalloc allocations and execute > atomic_long_add_return() in free_vmap_area_noflush(). > > [Solution] > Employ a local variable to record the total purged pages, and execute > atomic_long_sub() after the traversal of the purge_list is done. The > experiment result shows the latency improvement is 99%. > > [Experiment Result] > 1) System Configuration: Three servers (with HT-enabled) are tested. > * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1 > * 192-core server: 5th Gen Intel Xeon Scalable Processor*2 > * 448-core server: AMD Zen 4 Processor*2 > > 2) Kernel Config > * CONFIG_KASAN is disabled > > 3) The data in column "w/o patch" and "w/ patch" > * Unit: micro seconds (us) > * Each data is the average of 3-time measurements > > System w/o patch (us) w/ patch (us) Improvement (%) > --------------- -------------- ------------- ------------- > 72-core server 2194 14 99.36% > 192-core server 143799 1139 99.21% > 448-core server 1992122 6883 99.65% > > [1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py > [2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12 > [3] https://www.agner.org/optimize/instruction_tables.pdf > > Signed-off-by: Adrian Huang > --- > mm/vmalloc.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 3f9b6bd707d2..607697c81e60 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2210,6 +2210,7 @@ static void purge_vmap_node(struct work_struct *work) > { > struct vmap_node *vn = container_of(work, > struct vmap_node, purge_work); > + unsigned long nr_purged_pages = 0; > struct vmap_area *va, *n_va; > LIST_HEAD(local_list); > > @@ -2224,7 +2225,7 @@ static void purge_vmap_node(struct work_struct *work) > > list_del_init(&va->list); > > - atomic_long_sub(nr, &vmap_lazy_nr); > + nr_purged_pages += nr; > vn->nr_purged++; > > if (is_vn_id_valid(vn_id) && !vn->skip_populate) > @@ -2235,6 +2236,8 @@ static void purge_vmap_node(struct work_struct *work) > list_add(&va->list, &local_list); > } > > + atomic_long_sub(nr_purged_pages, &vmap_lazy_nr); > + > reclaim_list_global(&local_list); > } > > -- > 2.34.1 > I see the point and it looks good to me. Reviewed-by: Uladzislau Rezki (Sony) Thank you for improving this. There is one more spot which i detected earlier, it is: static void free_vmap_area_noflush(struct vmap_area *va) { unsigned long nr_lazy_max = lazy_max_pages(); unsigned long va_start = va->va_start; unsigned int vn_id = decode_vn_id(va->flags); struct vmap_node *vn; unsigned long nr_lazy; if (WARN_ON_ONCE(!list_empty(&va->list))) return; nr_lazy = atomic_long_add_return((va->va_end - va->va_start) >> PAGE_SHIFT, &vmap_lazy_nr); ... atomic_long_add_return() might also introduce a high contention. We can optimize by splitting into more light atomics. Can you check it on your 448-cores system? Tnanks! -- Uladzislau Rezki