From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9F65DC433E1 for ; Tue, 14 Jul 2020 06:31:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 46AC2217D8 for ; Tue, 14 Jul 2020 06:31:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="nVpM8O7y" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 46AC2217D8 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6FACD8D0002; Tue, 14 Jul 2020 02:31:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6AB918D0001; Tue, 14 Jul 2020 02:31:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5C1208D0002; Tue, 14 Jul 2020 02:31:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com [216.40.44.106]) by kanga.kvack.org (Postfix) with ESMTP id 4798C8D0001 for ; Tue, 14 Jul 2020 02:31:31 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id D509A8248D52 for ; Tue, 14 Jul 2020 06:31:30 +0000 (UTC) X-FDA: 77035709940.15.soda01_5f12f0026eee Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin15.hostedemail.com (Postfix) with ESMTP id AAC1F1814B0C8 for ; Tue, 14 Jul 2020 06:31:30 +0000 (UTC) X-HE-Tag: soda01_5f12f0026eee X-Filterd-Recvd-Size: 7845 Received: from mail-wr1-f68.google.com (mail-wr1-f68.google.com [209.85.221.68]) by imf12.hostedemail.com (Postfix) with ESMTP for ; Tue, 14 Jul 2020 06:31:30 +0000 (UTC) Received: by mail-wr1-f68.google.com with SMTP id f2so19849247wrp.7 for ; Mon, 13 Jul 2020 23:31:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:subject:to:cc:references:in-reply-to:mime-version :message-id:content-transfer-encoding; bh=IW5MsAw16WncT7hiN49Rl3HYAzAQeKbFqwcVhNGKS+0=; b=nVpM8O7ybmoMTE2akKTxb4v6KVMISCvIgqJqJwoSHz7qagb7clcsDC+oQbghkHElPc bjm8Hayl9wJtAbIZeB/zSnP71SuAi3ZZ9gwGNQUbLx+zLfKT/mL98S0vnSloOp8FEKm+ llTVPmLwkGgpk0HU9VepNa3BRC4XZTAOOYqEGL5xzDZ+7uEJW5gcwoQ4Fj4KvV5z51q1 JYdIEQF1uD+mz1mq86749Bjgs5ZqP5DaAH6LawN8dS6YhviPE0IiwuVRW2V3LxQ+wY2D zsIYrVjcm7MPuqhZSyITlHvpp6ktGzPjvSTFNN57jN4HmJQQfkicQ+1HDNrT042NANOi TA7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:subject:to:cc:references:in-reply-to :mime-version:message-id:content-transfer-encoding; bh=IW5MsAw16WncT7hiN49Rl3HYAzAQeKbFqwcVhNGKS+0=; b=PtrI6L/oOzihPOu1xUkMaFZNtuLEoZQ5uscFDEr9WfSSlKyB1a3KGI39UD0Cy0lNr2 bkgU8Bs84JY7VdvHGIUQ2Nz1z0h+HKy53cNGMK599gzh6G1eCEicKEryaVSO6BR+IV2B cEcZh6kocNEeMYSuPUvtytAUFv05JbVvUPCB76cgr3E1MiRsauf8qaN/lYVtLIg5oU5z 5Wsqy76lFeBopDdH/bQ22IP7PNQ7OMZpsJmTXD1zCVUwzJtpJ8Frt73klaaNOEuTCWLt 0ycphuhwuX/5DRkNZlmNtqMZ7t1A1sP6t2iECTJr/V9MsuxHDPNX46MkmmRFXEVQ1Ym7 FZkg== X-Gm-Message-State: AOAM533WyFzKmMn3CGb6FMCPNesa4NXRcYGA/LgbCKFthheW2/bZUXlb igQZ6F2L94HfW9mS8ai5AtI= X-Google-Smtp-Source: ABdhPJwfFy6dmo1Vm5gws7qxqL6AU1u3wZpvQnq5Cd3DfkTvP2ksYNqKbknLtrONUpUfmGGPoNjFgA== X-Received: by 2002:adf:e482:: with SMTP id i2mr3266445wrm.75.1594708288840; Mon, 13 Jul 2020 23:31:28 -0700 (PDT) Received: from localhost (110-174-173-27.tpgi.com.au. [110.174.173.27]) by smtp.gmail.com with ESMTPSA id z6sm2772208wmf.33.2020.07.13.23.31.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Jul 2020 23:31:28 -0700 (PDT) Date: Tue, 14 Jul 2020 16:31:20 +1000 From: Nicholas Piggin Subject: Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option To: Andy Lutomirski Cc: Anton Blanchard , Arnd Bergmann , linux-arch , LKML , Linux-MM , linuxppc-dev , Andy Lutomirski , Mathieu Desnoyers , Peter Zijlstra , X86 ML References: <1594658283.qabzoxga67.astroid@bobo.none> <010054C3-7FFF-4FB5-BDA8-D2B80F7B1A5D@amacapital.net> <1594701900.gcgdq8p13l.astroid@bobo.none> In-Reply-To: <1594701900.gcgdq8p13l.astroid@bobo.none> MIME-Version: 1.0 Message-Id: <1594708054.04iuyxuyb5.astroid@bobo.none> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: AAC1F1814B0C8 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm: > Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am: >>=20 >>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin wrote: >>>=20 >>> =EF=BB=BFExcerpts from Andy Lutomirski's message of July 14, 2020 1:59 = am: >>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin wr= ote: >>>>>=20 >>>>> On big systems, the mm refcount can become highly contented when doin= g >>>>> a lot of context switching with threaded applications (particularly >>>>> switching between the idle thread and an application thread). >>>>>=20 >>>>> Abandoning lazy tlb slows switching down quite a bit in the important >>>>> user->idle->user cases, so so instead implement a non-refcounted sche= me >>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot do= wn >>>>> any remaining lazy ones. >>>>>=20 >>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark >>>>> with as many software threads as CPUs (so each switch will go in and >>>>> out of idle), upstream can achieve a rate of about 1 million context >>>>> switches per second. After this patch it goes up to 118 million. >>>>>=20 >>>>=20 >>>> I read the patch a couple of times, and I have a suggestion that could >>>> be nonsense. You are, effectively, using mm_cpumask() as a sort of >>>> refcount. You're saying "hey, this mm has no more references, but it >>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down >>>> those references too." I'm wondering whether you actually need the >>>> IPI. What if, instead, you actually treated mm_cpumask as a refcount >>>> for real? Roughly, in __mmdrop(), you would only free the page tables >>>> if mm_cpumask() is empty. And, in the code that removes a CPU from >>>> mm_cpumask(), you would check if mm_users =3D=3D 0 and, if so, check i= f >>>> you just removed the last bit from mm_cpumask and potentially free the >>>> mm. >>>>=20 >>>> Getting the locking right here could be a bit tricky -- you need to >>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they >>>> should free the mm, and you also need to avoid an mm with mm_users >>>> hitting zero concurrently with the last remote CPU using it lazily >>>> exiting lazy TLB. Perhaps this could be resolved by having mm_count >>>> =3D=3D 1 mean "mm_cpumask() is might contain bits and, if so, it owns = the >>>> mm" and mm_count =3D=3D 0 meaning "now it's dead" and using some caref= ul >>>> cmpxchg or dec_return to make sure that only one CPU frees it. >>>>=20 >>>> Or maybe you'd need a lock or RCU for this, but the idea would be to >>>> only ever take the lock after mm_users goes to zero. >>>=20 >>> I don't think it's nonsense, it could be a good way to avoid IPIs. >>>=20 >>> I haven't seen much problem here that made me too concerned about IPIs=20 >>> yet, so I think the simple patch may be good enough to start with >>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the >>> unlazying with the exit TLB flush without doing anything fancy with >>> ref counting, but we'll see. >>=20 >> I would be cautious with benchmarking here. I would expect that the >> nasty cases may affect power consumption more than performance =E2=80=94= the=20 >> specific issue is IPIs hitting idle cores, and the main effects are to=20 >> slow down exit() a bit but also to kick the idle core out of idle.=20 >> Although, if the idle core is in a deep sleep, that IPI could be=20 >> *very* slow. >=20 > It will tend to be self-limiting to some degree (deeper idle cores > would tend to have less chance of IPI) but we have bigger issues on > powerpc with that, like broadcast IPIs to the mm cpumask for THP > management. Power hasn't really shown up as an issue but powerpc > CPUs may have their own requirements and issues there, shall we say. >=20 >> So I think it=E2=80=99s worth at least giving this a try. >=20 > To be clear it's not a complete solution itself. The problem is of=20 > course that mm cpumask gives you false negatives, so the bits > won't always clean up after themselves as CPUs switch away from their > lazy tlb mms. ^^ False positives: CPU is in the mm_cpumask, but is not using the mm as a lazy tlb. So there can be bits left and never freed. If you closed the false positives, you're back to a shared mm cache line on lazy mm context switches. Thanks, Nick