From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70619C2BD09 for ; Thu, 27 Jun 2024 16:55:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D10186B008A; Thu, 27 Jun 2024 12:55:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CBFD86B0092; Thu, 27 Jun 2024 12:55:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B87196B0096; Thu, 27 Jun 2024 12:55:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 963B36B008A for ; Thu, 27 Jun 2024 12:55:19 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id BAA77A05C7 for ; Thu, 27 Jun 2024 16:55:18 +0000 (UTC) X-FDA: 82277269116.03.06833AE Received: from mail-ej1-f45.google.com (mail-ej1-f45.google.com [209.85.218.45]) by imf21.hostedemail.com (Postfix) with ESMTP id D2F931C0008 for ; Thu, 27 Jun 2024 16:55:16 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aHDLCk8P; spf=pass (imf21.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719507307; a=rsa-sha256; cv=none; b=a6T6HIATEoxKr6G0WPOcJiQRVo+xBvyP7stw970TAELP94895TEO/+pmfwIE2vLIQulQUj PuUCpFaNE45t+SbDZ0kTR6fc/CAfjJuHE+C1Eiym4aEQSu84zmkoudBXOPfdv9uX5KAZQs gppPUjTYsqKYwV2enzPMygKPTkyvwZE= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aHDLCk8P; spf=pass (imf21.hostedemail.com: domain of mjguzik@gmail.com designates 209.85.218.45 as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719507307; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=niteNh3TbVLu7eb0D9sBMdcI9GkHgkcpq+/w6gXTGSU=; b=xLNPce8tPyn2MJ+8C+/N/LAJGnT9BKM8s5QPDJGLJ037rUeJgXqgigf8qUdhJEEYFoyxwY wvuiXN57iYRwjGT8yEhOJjnDxvGSE5BM3ODhhGE3Vpczfc8Dz4rdUr/Nyf54jqnkorlPB6 8LUjc3SE39DhaPjnH0kW9f/r48JdzLM= Received: by mail-ej1-f45.google.com with SMTP id a640c23a62f3a-a6fdd947967so646243466b.2 for ; Thu, 27 Jun 2024 09:55:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719507315; x=1720112115; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=niteNh3TbVLu7eb0D9sBMdcI9GkHgkcpq+/w6gXTGSU=; b=aHDLCk8PRMDvLcYqEC3McQV2X4Cu7nlHRDbmvTg9vCcl6Vxypxl4G2zpmaofzSVf5d 8cXsakKHaAfPkwcNVfc5gtK3ZLO8Ke4SHmQ+6wU0qYquO0okGe2L4JOuGya4TrZSVOpB pW0VDIYcB8c6w6yNpLUv74ST78XRb0vGDGxWuSqrc3eXrkvFsO7RvgJLrLr+5qRuPn0x /TvJlb8V4PD5azwpR3fMGfom+nEErcqgziQiQUSMP2Trpp4PAUwEpBT3Ntgh29jIF7xA +8qjH/L2PR2otytuA26CLC1qSQTf9dBafAZFwquApbdjPGfa9+eh18wyTK1iMIrfgeEa ny0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719507315; x=1720112115; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=niteNh3TbVLu7eb0D9sBMdcI9GkHgkcpq+/w6gXTGSU=; b=umIkm0fiFYvaYDVhbcCD6aDtdOSMlfPKsAg/4UwfsycK+/3PjYk6Vs8KtsT/l5gI4o sF90uZmqdzAVbZT8muXF9UHA7Gd1CL8NAKUxWrsxmi2CMrfsdqUWCGXgwZKzoNpv3NqW MSpKBH1V5lvymdWuokVvTzTV1AzcvHHam8KcvdHYqBOsRwiPthskXjeqNQRvmkTPePS3 B0O3Z9VbFhzYBwnimpIZkeIiG+v+b7ZJeytut7IS7MPhDEFP41OwzlJGbGlzpr2Cpu5i k9249X9UmCAvkcifD/LyT2rKsbCXZnVuy0O/Y7E+PMsQ5rMduYTLlwAHd3gu57NVnOTx FKXA== X-Forwarded-Encrypted: i=1; AJvYcCVa1xhXWItOhuvnGQyNpeBo3g8PyKymQmA2TvzsXytkRr5nuyL7ny42VM4zyhXvaS1Qf/v3c9aa0ve6gCnycF7pXwk= X-Gm-Message-State: AOJu0Yw5KMMSm5OYJBWeRb9RhRJrrp+wHh/aD95keQ7w7Wsve+txiKIr oiQeKw12zrKg7NGThI7eDOJRQKSx7MgQfpSeNrPd5NmYegAVNSX5mxVfICVKV6fJ0Eze7vV1Zns MZkZf09l50U8nlmH2ChC+/w4e/fg= X-Google-Smtp-Source: AGHT+IH0dIKUR/haFbE60cInxlDg0jpKpfwKboTD7Ew46GpSk6YvqS7ECQ1CFI021YcGgAHu64qodhxqduXhre2e1pk= X-Received: by 2002:a17:906:6a22:b0:a72:a206:ddc2 with SMTP id a640c23a62f3a-a72a206df24mr160444366b.36.1719507314776; Thu, 27 Jun 2024 09:55:14 -0700 (PDT) MIME-Version: 1.0 References: <202406270912.633e6c61-oliver.sang@intel.com> In-Reply-To: From: Mateusz Guzik Date: Thu, 27 Jun 2024 18:55:02 +0200 Message-ID: Subject: Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression To: Linus Torvalds Cc: kernel test robot , oe-lkp@lists.linux.dev, lkp@intel.com, Linux Memory Management List , Christian Brauner , linux-kernel@vger.kernel.org, ying.huang@intel.com, feng.tang@intel.com, fengwei.yin@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: bwhfzodmfbqm5kd3gbx988bwp68bczxs X-Rspamd-Queue-Id: D2F931C0008 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1719507316-977210 X-HE-Meta: U2FsdGVkX1/6G6qDdgiMHs2jS0X74dac+fjudGMgQlpFLWYn/oxWWc1ZDi7eI2HKmxMY5Z/76De5BmoQzdC4WqCJ/qKYmALrjWBS7JoVeiNLEXOyDdb2ltuyC485Drzhk9Iv2f5cf/ZBGyGK3f8u2X8YzKjq7bvVRd5CeFgjG9avDzEc3w/WULfYim9HOn22e5oh9MP6PHLmQ/t+gnAWxPgEQ2jOYntjQvlcxgf40xGQmBV3Wh4REMN+Gq6W4bMnLcodxuhMTOhct4zHh6u81p5AX5h3dWWnm3PRrbuw2FCkaQCsYLAi3gFT5ldHqnQ8l54y/QHUNi5GdOOUhFz0T9V/lyNA8kUCpxXk4cNAuhjB4m5j6BJVqFYO2+1lzziZBGvHdECnX7n6SCFNIbMQ78Why3svas9bm48ML7e+QlqWHcy/VLauDcBspURD5A8xSoWXkk7q2jF0Q4shdEGCy44zd45saeDtfroSmsoUFgrqU4500qjOM08qEXxUFuG+QlL6uRhlv6zFGXcG4+knyJJxZstPoUCRLyGVTe75UXaC0WYJwbRagwuu/v9u6RS4aVBKjCWDA6kc9f/NaZxIe43nAQvY5HWkMWMA776tZ0QHxr4+51WIDT/wLzTcAL5cEqfxchTwhXIsh4vkhlU0ihPQZDjAcsjhPwudf5p/mbN4DTH8AN91dcq+wfGGU5YbvEYz6AwyXAiB2BwGxD6DrEB2EnDr91acH/lDdZ/OQAaq/TwIQhreiiB7MDasaZSrHOOI8YA4eWGyAzGGxSK46M99a4NLyTN9LtymDHlxUc3SLd1YE0hi0EFq6OM6sC+dhIhiiQdAa4J/htLURtY+IjxXb8lTroqdUcix3DS3Y09V4cRskJjkO31huC0Y3gVIlAxYupPoKFeRS1243SMyXfgLvaf8VOMQS35rBvR538SOCx2Pb43iVl+88I8rEfNCJ8XY9iEx3uAVUTz5GGa vb3NKdLY ljnFYJK8r4uBvYmNvBPsUJgg7SjgGMdRQwEe5SQgtrMZu53Tne4ZPsa6XXkHBDvZinxSk+jOUm8sSU008lVNsG1D84uF7WmlIJvsPxrOzyA9D+YuJEj6KWCa4KYizJm9NX8F50DrEaokJeMsknnQX8BiXEJ/R8iBnTIfW3jMVwRgDSvm54FJTxvsrAKnP6BXPnnDuxLf9neeKz5TaPMD7WKEcZ55MxEDm8hyfr7pr/avx/w/SjKaJ7gZoM1llGHR0IzIJvBYIA6rvsw3c2xXKbxlnsG8bg1CjN/c5pOGaoy1PuoGUHWwUgIpGmRfU5fJij2eYTzl503Acjaoow85RBToDz/D1PCB8Rn7849UoqLNVhR+8sbQgIZzEFiAqYVvwIpMl X-Bogosity: Ham, tests=bogofilter, spamicity=0.008358, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jun 27, 2024 at 6:32=E2=80=AFPM Linus Torvalds wrote: > > On Thu, 27 Jun 2024 at 00:00, Mateusz Guzik wrote: > > > > I'm arranging access to a 128-way machine to prod, will report back > > after I can profile on it. > > Note that the bigger problem is probably not the 128-way part, but the > "two socket" part. > I know mate, I'm painfully aware of NUMA realities. The bigger stuff I have intermittent access to is all multi socket, the above is basically "need to use something bigger than my usual trusty small core setup". > From a quick look at the profile data, we have for self-cycles: > > shell1 subtest: > +4.2 lockref_get_not_dead > +4.5 lockref_put_return > +16.8 native_queued_spin_lock_slowpath > > getdent subtest: > +4.1 lockref_put_return > +5.7 lockref_get_not_dead > +68.0 native_queued_spin_lock_slowpath > > which means that the spin lock got much more expensive, _despite_ the > "fast path" in lockref being more aggressive. > > Which in turn implies that the problem may be at least partly simply > due to much more cacheline ping-pong. In particular, the lockref > routines may be "fast", but they hit that one cacheline over and over > again and have a thundering herd issue, while the queued spinlocks on > their own actually try to avoid that for multiple CPU's. > > IOW, the queue in the queued spinlocks isn't primarily about fairness > (although that is a thing), it's about not having all CPU cores > accessing the same spinlock cacheline. > > Note that a lot of the other numbers seem "incidental". For example, > for the getdents subtest we have a lot of the numbers going down by > ~55%, but while that looks like a big change, it's actually just a > direct result of this: > > -56.5% stress-ng.getdent.ops > > iow, the benchmark fundamentally did 56% less work. > > IOW, I think this may all be fundamental, and we just can't do the > "wait for spinlock" thing, because that whole loop with a cpu_relax() > is just deadly. > > And we've seen those kinds of busy-loops be huge problems before. When > you have this kind of busy-loop: > > old.lock_count =3D READ_ONCE(lockref->lock_count); > do { > if (lockref_locked(old)) { > cpu_relax(); > old.lock_count =3D READ_ONCE(lockref->lock_count); > continue; > } > > the "cpu_relax()" is horrendously expensive, but not having it is not > really an option either, since it will just cause a tight core-only > loop. > > I suspect removing the cpu_relax() would help performance, but I > suspect the main help would come from it effectively cutting down the > wait cycles to practically nothing. > As far as lockref goes I had 2 ideas to test: 1. cpu_relax more than once, backoff style. Note the the total spin count would still be bound before the routine gives up and locks. this should reduce pulling 2. check how many spins would be needed to wait for before lockref decides to fall back to locking When messing around not automatically taking the lock I measured spin counts with an artificially high limit. It was something like < 300 to cover the cases I ran into (not this benchmark). For all I know the limit can be bumped to -- say 256 -- and numerous acquires will disappear at that scale, which should be good enough for everybody(tm). All while forward progress is guaranteed. That aside even that getdent thing uses mutexes and there is something odd going on there. I verified workers go to sleep when it is avoidable, which disfigures the result. I don't know the scale of it, it may be it is a tiny fraction of consumers, but I do see them on offcpu tracing. That's that for handwaving. I'm going to get the hw and get hard profiling data + results. --=20 Mateusz Guzik