From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F7FAC3A5A2 for ; Fri, 20 Sep 2019 21:25:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2431720820 for ; Fri, 20 Sep 2019 21:25:01 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=nvidia.com header.i=@nvidia.com header.b="p/Qv+J/O" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2431720820 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=nvidia.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 95B046B0003; Fri, 20 Sep 2019 17:24:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9330D6B0005; Fri, 20 Sep 2019 17:24:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 848666B0006; Fri, 20 Sep 2019 17:24:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0203.hostedemail.com [216.40.44.203]) by kanga.kvack.org (Postfix) with ESMTP id 624ED6B0003 for ; Fri, 20 Sep 2019 17:24:59 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with SMTP id E17741EE6 for ; Fri, 20 Sep 2019 21:24:58 +0000 (UTC) X-FDA: 75956579076.30.ants52_659b3d68e8052 X-HE-Tag: ants52_659b3d68e8052 X-Filterd-Recvd-Size: 7921 Received: from hqemgate15.nvidia.com (hqemgate15.nvidia.com [216.228.121.64]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Fri, 20 Sep 2019 21:24:58 +0000 (UTC) Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqemgate15.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Fri, 20 Sep 2019 14:25:02 -0700 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Fri, 20 Sep 2019 14:24:56 -0700 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Fri, 20 Sep 2019 14:24:56 -0700 Received: from DRHQMAIL107.nvidia.com (10.27.9.16) by HQMAIL111.nvidia.com (172.20.187.18) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Fri, 20 Sep 2019 21:24:56 +0000 Received: from [10.110.48.28] (10.124.1.5) by DRHQMAIL107.nvidia.com (10.27.9.16) with Microsoft SMTP Server (TLS) id 15.0.1473.3; Fri, 20 Sep 2019 21:24:55 +0000 Subject: Re: [PATCH v2 00/11] Introduces new count-based method for monitoring lockless pagetable wakls To: Leonardo Bras , , CC: Benjamin Herrenschmidt , Paul Mackerras , Michael Ellerman , Arnd Bergmann , Aneesh Kumar K.V , "Christophe Leroy" , Andrew Morton , Dan Williams , Nicholas Piggin , Mahesh Salgaonkar , Thomas Gleixner , Richard Fontana , Ganesh Goudar , Allison Randal , "Greg Kroah-Hartman" , Mike Rapoport , YueHaibing , Ira Weiny , Jason Gunthorpe , Keith Busch , Linux-MM References: <20190920195047.7703-1-leonardo@linux.ibm.com> <1f5d9380418ad8bb90c6bbdac34716c650b917a0.camel@linux.ibm.com> X-Nvconfidentiality: public From: John Hubbard Message-ID: <95a6e165-cc71-e584-8d17-df05c4a95aaa@nvidia.com> Date: Fri, 20 Sep 2019 14:24:55 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <1f5d9380418ad8bb90c6bbdac34716c650b917a0.camel@linux.ibm.com> X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL111.nvidia.com (172.20.187.18) To DRHQMAIL107.nvidia.com (10.27.9.16) Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1569014702; bh=7mb607kPCYdMb6CIQs+YSAHMRbWUkfZiqvJ+hX6HxuU=; h=X-PGP-Universal:Subject:To:CC:References:X-Nvconfidentiality:From: Message-ID:Date:User-Agent:MIME-Version:In-Reply-To: X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=p/Qv+J/O63xzA8dj9O6aHi3OzMNTLsLFe51P4gW7MAl1ZL9Dwzke1xvZUXeX5wv3q kzQi6QNgb0Dx/pLZa6YgpbfSFh13Nlgfz9QnOia+i4GikN1a/2BPPALir/fMA3JYB3 1+DsoGzgAF3YF/eqnpOt17sUc+ltPjAi4Ynv3v0Ovi6iOexvBTsjLBXysvi4mTXFp+ SE9xxwnpcig2NN9uRA2ADS+srZadcjfU1bGRogNuUl43yF/IJEjUpLQ0tkqvMqcJQ+ CuH3vsDa6uW2BjCwk2W2bmxU2HBbN1u/lCqxWQMYpWI8TtE5ahRS2sHsGW8BtBADQP MhuLe8jIWepLQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 9/20/19 1:12 PM, Leonardo Bras wrote: > If a process (qemu) with a lot of CPUs (128) try to munmap() a large > chunk of memory (496GB) mapped with THP, it takes an average of 275 > seconds, which can cause a lot of problems to the load (in qemu case, > the guest will lock for this time). > > Trying to find the source of this bug, I found out most of this time is > spent on serialize_against_pte_lookup(). This function will take a lot > of time in smp_call_function_many() if there is more than a couple CPUs > running the user process. Since it has to happen to all THP mapped, it > will take a very long time for large amounts of memory. > > By the docs, serialize_against_pte_lookup() is needed in order to avoid > pmd_t to pte_t casting inside find_current_mm_pte(), or any lockless > pagetable walk, to happen concurrently with THP splitting/collapsing. > > It does so by calling a do_nothing() on each CPU in mm->cpu_bitmap[], > after interrupts are re-enabled. > Since, interrupts are (usually) disabled during lockless pagetable > walk, and serialize_against_pte_lookup will only return after > interrupts are enabled, it is protected. > > So, by what I could understand, if there is no lockless pagetable walk > running, there is no need to call serialize_against_pte_lookup(). > > So, to avoid the cost of running serialize_against_pte_lookup(), I > propose a counter that keeps track of how many find_current_mm_pte() > are currently running, and if there is none, just skip > smp_call_function_many(). Just noticed that this really should also include linux-mm, maybe it's best to repost the patchset with them included? In particular, there is likely to be some feedback about adding more calls, in addition to local_irq_disable/enable, around the gup_fast() path, separately from my questions about the synchronization cases in ppc. thanks, -- John Hubbard NVIDIA > > The related functions are: > start_lockless_pgtbl_walk(mm) > Insert before starting any lockless pgtable walk > end_lockless_pgtbl_walk(mm) > Insert after the end of any lockless pgtable walk > (Mostly after the ptep is last used) > running_lockless_pgtbl_walk(mm) > Returns the number of lockless pgtable walks running > > > On my workload (qemu), I could see munmap's time reduction from 275 > seconds to 418ms. > >> Leonardo Bras (11): >> powerpc/mm: Adds counting method to monitor lockless pgtable walks >> asm-generic/pgtable: Adds dummy functions to monitor lockless pgtable >> walks >> mm/gup: Applies counting method to monitor gup_pgd_range >> powerpc/mce_power: Applies counting method to monitor lockless pgtbl >> walks >> powerpc/perf: Applies counting method to monitor lockless pgtbl walks >> powerpc/mm/book3s64/hash: Applies counting method to monitor lockless >> pgtbl walks >> powerpc/kvm/e500: Applies counting method to monitor lockless pgtbl >> walks >> powerpc/kvm/book3s_hv: Applies counting method to monitor lockless >> pgtbl walks >> powerpc/kvm/book3s_64: Applies counting method to monitor lockless >> pgtbl walks >> powerpc/book3s_64: Enables counting method to monitor lockless pgtbl >> walk >> powerpc/mm/book3s64/pgtable: Uses counting method to skip serializing >> >> arch/powerpc/include/asm/book3s/64/mmu.h | 3 +++ >> arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +++++ >> arch/powerpc/kernel/mce_power.c | 13 ++++++++++--- >> arch/powerpc/kvm/book3s_64_mmu_hv.c | 2 ++ >> arch/powerpc/kvm/book3s_64_mmu_radix.c | 20 ++++++++++++++++++-- >> arch/powerpc/kvm/book3s_64_vio_hv.c | 4 ++++ >> arch/powerpc/kvm/book3s_hv_nested.c | 8 ++++++++ >> arch/powerpc/kvm/book3s_hv_rm_mmu.c | 9 ++++++++- >> arch/powerpc/kvm/e500_mmu_host.c | 4 ++++ >> arch/powerpc/mm/book3s64/hash_tlb.c | 2 ++ >> arch/powerpc/mm/book3s64/hash_utils.c | 7 +++++++ >> arch/powerpc/mm/book3s64/mmu_context.c | 1 + >> arch/powerpc/mm/book3s64/pgtable.c | 20 +++++++++++++++++++- >> arch/powerpc/perf/callchain.c | 5 ++++- >> include/asm-generic/pgtable.h | 9 +++++++++ >> mm/gup.c | 4 ++++ >> 16 files changed, 108 insertions(+), 8 deletions(-) >>