From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,MIME_QP_LONG_LINE, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B224C433E3 for ; Tue, 14 Jul 2020 12:46:12 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4DEFF22409 for ; Tue, 14 Jul 2020 12:46:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=amacapital-net.20150623.gappssmtp.com header.i=@amacapital-net.20150623.gappssmtp.com header.b="D1NGAPg9" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4DEFF22409 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=amacapital.net Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 88FEE6B0002; Tue, 14 Jul 2020 08:46:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 83FF38D0001; Tue, 14 Jul 2020 08:46:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 706D56B0005; Tue, 14 Jul 2020 08:46:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0121.hostedemail.com [216.40.44.121]) by kanga.kvack.org (Postfix) with ESMTP id 5466D6B0002 for ; Tue, 14 Jul 2020 08:46:11 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id BFD248248D52 for ; Tue, 14 Jul 2020 12:46:10 +0000 (UTC) X-FDA: 77036654100.19.mom62_520fe5326ef1 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin19.hostedemail.com (Postfix) with ESMTP id 8892A1AD1B4 for ; Tue, 14 Jul 2020 12:46:10 +0000 (UTC) X-HE-Tag: mom62_520fe5326ef1 X-Filterd-Recvd-Size: 8221 Received: from mail-pj1-f68.google.com (mail-pj1-f68.google.com [209.85.216.68]) by imf22.hostedemail.com (Postfix) with ESMTP for ; Tue, 14 Jul 2020 12:46:09 +0000 (UTC) Received: by mail-pj1-f68.google.com with SMTP id cm21so1526898pjb.3 for ; Tue, 14 Jul 2020 05:46:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amacapital-net.20150623.gappssmtp.com; s=20150623; h=content-transfer-encoding:from:mime-version:subject:date:message-id :references:cc:in-reply-to:to; bh=/SewYUnqs72nnjMOfIaT/P3hnIetrZLumBpPg+JegX0=; b=D1NGAPg9DguAGriffg1ZKDQlJlC1+rzx1RCn9Z1SjvMh8mVyP85iJVTOk27SO4a08k 1V5qK1FMv2kRDf61ZAwWSToubCahJWjZawbE9dK/JQl/n6KGaGigwT8v4+/JvSOH3c9g Mz5Lgz2PjW6tfU5YGXzfAQPF4Wg/hMeniIXfs+bXaz6nZPur23tEIN3EVe4/csCszP7T aa4VAfNXB2ufQhsSmU5YlKzEACcAVaHvnN4wfizMbk4JbzAHCQuXlr1YyqS155MassiF mgNzS52TqPR5Z4Go7wsaOTe0Q3qKQGveLcrduWSquAZ69BfUR0H0nvbYv0E4QcG7eqYv Zv5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:content-transfer-encoding:from:mime-version :subject:date:message-id:references:cc:in-reply-to:to; bh=/SewYUnqs72nnjMOfIaT/P3hnIetrZLumBpPg+JegX0=; b=UQnSLvfTF7yeqbycBc3oRfGxVp1523/W4bBWLb7BCU3jH8R6BPadfQEQjpjirnUy1j AHiW3t/ZA5SmKoCmJ+CNV/D0TJrs1cjz8g39Um7IuCF4OiRy2kwKXx6SbRlZ6TRQRWhi cB1iwMiNufqe15PZ00tLkUQPu7+qrm7AkrxnSY3PLRxc0ut0znsYjjE9vQcVgdQJQi/9 a2ckoKKQPd5W/P2inUslMVX6IfelFwRQObuZxO0QtgRmnZN7QUkl/JVyGv//p8kjRmpC v0SHp8T9FjygwGaiJtsMnitcA+aPW5zIG+Yn4If768bJYOS7Y3VJEmRTfBCjssqJg3UF hhrA== X-Gm-Message-State: AOAM532AlxDI0M4W6c1QkG+NjlGAAGTHMOHT5SlfZKB4+U3ua8jpEoJS aER2m1763/k3jY/2k4wu9sMgJQ== X-Google-Smtp-Source: ABdhPJxJbeD76js5UkUJHGRXxQF8mvRABkEpz0TioNSqyjRD+fSmGpaHXMGBmfPRvfGBot/P0MR7fg== X-Received: by 2002:a17:90a:3a81:: with SMTP id b1mr4539573pjc.217.1594730768819; Tue, 14 Jul 2020 05:46:08 -0700 (PDT) Received: from ?IPv6:2601:646:c200:1ef2:d111:b7a3:a3d3:c7aa? ([2601:646:c200:1ef2:d111:b7a3:a3d3:c7aa]) by smtp.gmail.com with ESMTPSA id g28sm17494542pfr.70.2020.07.14.05.46.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 14 Jul 2020 05:46:07 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Andy Lutomirski Mime-Version: 1.0 (1.0) Subject: Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Date: Tue, 14 Jul 2020 05:46:05 -0700 Message-Id: <6D3D1346-DB1E-43EB-812A-184918CCC16A@amacapital.net> References: <1594708054.04iuyxuyb5.astroid@bobo.none> Cc: Anton Blanchard , Arnd Bergmann , linux-arch , LKML , Linux-MM , linuxppc-dev , Andy Lutomirski , Mathieu Desnoyers , Peter Zijlstra , X86 ML In-Reply-To: <1594708054.04iuyxuyb5.astroid@bobo.none> To: Nicholas Piggin X-Mailer: iPhone Mail (17F80) X-Rspamd-Queue-Id: 8892A1AD1B4 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Jul 13, 2020, at 11:31 PM, Nicholas Piggin wrote: >=20 > =EF=BB=BFExcerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:= >> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>>=20 >>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin wrote:= >>>>=20 >>>> =EF=BB=BFExcerpts from Andy Lutomirski's message of July 14, 2020 1:59 a= m: >>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin wr= ote: >>>>>>=20 >>>>>> On big systems, the mm refcount can become highly contented when doin= g >>>>>> a lot of context switching with threaded applications (particularly >>>>>> switching between the idle thread and an application thread). >>>>>>=20 >>>>>> Abandoning lazy tlb slows switching down quite a bit in the important= >>>>>> user->idle->user cases, so so instead implement a non-refcounted sche= me >>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot do= wn >>>>>> any remaining lazy ones. >>>>>>=20 >>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>>> with as many software threads as CPUs (so each switch will go in and >>>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>>> switches per second. After this patch it goes up to 118 million. >>>>>>=20 >>>>>=20 >>>>> I read the patch a couple of times, and I have a suggestion that could= >>>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>>> refcount. You're saying "hey, this mm has no more references, but it >>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>>> those references too." I'm wondering whether you actually need the >>>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>>> for real? Roughly, in __mmdrop(), you would only free the page tables= >>>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>>> mm_cpumask(), you would check if mm_users =3D=3D 0 and, if so, check i= f >>>>> you just removed the last bit from mm_cpumask and potentially free the= >>>>> mm. >>>>>=20 >>>>> Getting the locking right here could be a bit tricky -- you need to >>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>>> should free the mm, and you also need to avoid an mm with mm_users >>>>> hitting zero concurrently with the last remote CPU using it lazily >>>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>>> =3D=3D 1 mean "mm_cpumask() is might contain bits and, if so, it owns t= he >>>>> mm" and mm_count =3D=3D 0 meaning "now it's dead" and using some caref= ul >>>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>>=20 >>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>>> only ever take the lock after mm_users goes to zero. >>>>=20 >>>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>>=20 >>>> I haven't seen much problem here that made me too concerned about IPIs=20= >>>> yet, so I think the simple patch may be good enough to start with >>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the= >>>> unlazying with the exit TLB flush without doing anything fancy with >>>> ref counting, but we'll see. >>>=20 >>> I would be cautious with benchmarking here. I would expect that the >>> nasty cases may affect power consumption more than performance =E2=80=94= the=20 >>> specific issue is IPIs hitting idle cores, and the main effects are to=20= >>> slow down exit() a bit but also to kick the idle core out of idle.=20 >>> Although, if the idle core is in a deep sleep, that IPI could be=20 >>> *very* slow. >>=20 >> It will tend to be self-limiting to some degree (deeper idle cores >> would tend to have less chance of IPI) but we have bigger issues on >> powerpc with that, like broadcast IPIs to the mm cpumask for THP >> management. Power hasn't really shown up as an issue but powerpc >> CPUs may have their own requirements and issues there, shall we say. >>=20 >>> So I think it=E2=80=99s worth at least giving this a try. >>=20 >> To be clear it's not a complete solution itself. The problem is of=20 >> course that mm cpumask gives you false negatives, so the bits >> won't always clean up after themselves as CPUs switch away from their >> lazy tlb mms. >=20 > ^^ >=20 > False positives: CPU is in the mm_cpumask, but is not using the mm > as a lazy tlb. So there can be bits left and never freed. >=20 > If you closed the false positives, you're back to a shared mm cache > line on lazy mm context switches. x86 has this exact problem. At least no more than 64*8 CPUs share the cache l= ine :) Can your share your benchmark? >=20 > Thanks, > Nick