From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DE226C7EE26 for ; Fri, 19 May 2023 11:49:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5F673900004; Fri, 19 May 2023 07:49:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5A67E900003; Fri, 19 May 2023 07:49:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44793900004; Fri, 19 May 2023 07:49:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 30A36900003 for ; Fri, 19 May 2023 07:49:41 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CDFCAAE52E for ; Fri, 19 May 2023 11:49:40 +0000 (UTC) X-FDA: 80806834920.05.9D844B1 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 7A898C0009 for ; Fri, 19 May 2023 11:49:38 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QdUOjDlU; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684496978; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=O4Z2m/dBtl/Kzo/TIsl1VrWST49SCHOBeC3YC1YAZv8=; b=x/nVeqlCYXkDJgG3IkU9mIj7AH4yaN20Q1ASKnJBQI8YM1VMLt9fnmaliuqU7YReSRpfUj Si6pgi3rGuLXhnLg9xsdGbjPUJDGg4cy79G8B89R4tp4wwb/DY2dRuh4QgkewkEPdgad04 nShbb1Q8P0+44r5fepVM7d9INCKc/34= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QdUOjDlU; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684496978; a=rsa-sha256; cv=none; b=IMiB5NhRtyvQop8nWbJ8e11rePVBRN65Gvl9UQ2NeW5DARYPmFf1WpCmKgIZ2YfZuiI6fV uqa5I2yOkbRVC0FeGxFfWI+O91T6QbFmo3tDXu5bG+prca5iuYVp/8R6YyDVgc65oWiwUF /K3mH9ee5F6FP5vbGE+LN+QtLBvh/cQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684496977; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=O4Z2m/dBtl/Kzo/TIsl1VrWST49SCHOBeC3YC1YAZv8=; b=QdUOjDlUjfcDwyhAR6k1f4ORcYcKQecynCGX/Xq5n/7NxZhpWC2yv8D1lmYwRKtiszNrc/ /1fFFgjf5h8F7L/SKQn9I4iRywKVLFRuSCUHQ+xZEI1myBmX5vIRP9NW+DDeEpDRoqCIqa iHWWxMpjd3O4qTkjfoIMWLCBayYrl68= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-657-c05BJ-oUMSmeP-0vcrYVFw-1; Fri, 19 May 2023 07:49:33 -0400 X-MC-Unique: c05BJ-oUMSmeP-0vcrYVFw-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.rdu2.redhat.com [10.11.54.2]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E4A29811E7F; Fri, 19 May 2023 11:49:32 +0000 (UTC) Received: from localhost (ovpn-12-79.pek2.redhat.com [10.72.12.79]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 2963A40C6CCC; Fri, 19 May 2023 11:49:31 +0000 (UTC) Date: Fri, 19 May 2023 19:49:28 +0800 From: Baoquan He To: Thomas Gleixner Cc: "Russell King (Oracle)" , Andrew Morton , linux-mm@kvack.org, Christoph Hellwig , Uladzislau Rezki , Lorenzo Stoakes , Peter Zijlstra , John Ogness , linux-arm-kernel@lists.infradead.org, Mark Rutland , Marc Zyngier , x86@kernel.org Subject: Re: Excessive TLB flush ranges Message-ID: References: <87ilcs8zab.ffs@tglx> <87fs7w8z6y.ffs@tglx> <874joc8x7d.ffs@tglx> <87r0rg73wp.ffs@tglx> <87edng6qu8.ffs@tglx> <87y1ln5md2.ffs@tglx> <875y8o5zwm.ffs@tglx> MIME-Version: 1.0 In-Reply-To: <875y8o5zwm.ffs@tglx> X-Scanned-By: MIMEDefang 3.1 on 10.11.54.2 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Rspamd-Queue-Id: 7A898C0009 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: cugmbd53sp3bch87xusxn5xm5fwucbt3 X-HE-Tag: 1684496978-520117 X-HE-Meta: U2FsdGVkX19Xgvdq8BH9PisU57hB13tEDrsVxNDAMHL2lNbf6CtgU9GFEq6Se+uBbiBJ94+piZGrfQ5aUAr4RThOte/6aJ0P3QGEawohjSV6U0x2Zokmr2D1Ms23aYmijXTakiGcEvLfVhjdjhuxBZKrjQ/btMvyFnXAddRNHc7Q4Dd1R72PEhyNlSPppwl3IhnG62TMZhPKHFD8cTBjPt+lEiDXYgTAlyBNdYI3VFKcyHAWt5n3SBGUPOXWMXQ47YerERd5xwO7yDyvlMXLzW7v0HqIFKAPSVKkR9dX5Scv8P+0rvAPYp/Da9ul1gsY0JWcRvO0CIHJ4GiSnkYsaWScN+L2JoyvufHOZqD/Gi6IwvDaGniBV8u7Iv8Qq3WcRPRKPzq8P1sE8mLl4/m1YdYtpeUPRi8I25XfpsHbqYsQEwnqD4IBZOBaNevly0I/6FU5WjJdgaIY0QJYezUecOL/5zpzhVVj+27K7mONGg9KbB2ABbib8+nvBwMU8HDseuPMnG0cHNAb+/D3GkaSNIMpvWaTuE79Fi9kAR4oINf1BRXuKnp7C2N1FGWGiMT1jaCjZ4R/2I/GEGkFzYHJiv2mpCqjL7AdS+apTmbK6uYIyhpMnLhlkrWzIHWjfUu3JfrVgBSJT7G4wThr2UyCk86fq7Swh7Avka01xKfoqTuAVvpnec369IH867M78yrfwJT5QYXyiodE/HqsAxJosE+XrRTnhkFSj6gGH/DHvWbnG1JfN1OQDcQkg7rNK5hYUru0y8DOUlKiqadbas1OWPT+5oMLbb9mTTpjb57/Ad5dw8U7NlCmrxee6icltusX78Nf8f9V4MocSynpVc7Dhw2wTQ5crQhDwFyCuGcnfHScThTkENRie38t37IqDnZK7/GU1rAAEbH0YuWyIbxT9ucZ9qryjEQ1qJhYrb7WK97ABa0EMZ6VfesOh9ptZLY9s+e7bsSeBIF/JK5b+je MrceJiD1 96a/Jn6Rr/mWNHIxlhe/GAXUXK2NpU3lhAPf92O+ahzk4vzJBof+azv+H0MRVCrLR8owNtIpenm86edBd3pjBFFoUTYlpsZmzcbdqyFUKULh+GALd0RO6JILGKdw7PSB5yCY5CnoGYiKmQ85VjYkW8dFAPBKy5Qs8hkK9zBkfZl501JbvCoGtimPYpZp44I/Yno4DxtFjL+qUjnw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 05/19/23 at 01:22pm, Thomas Gleixner wrote: > On Wed, May 17 2023 at 18:52, Baoquan He wrote: > > On 05/17/23 at 11:38am, Thomas Gleixner wrote: > >> On Tue, May 16 2023 at 21:03, Thomas Gleixner wrote: > >> > > >> > Aside of that, if I read the code correctly then if there is an unmap > >> > via vb_free() which does not cover the whole vmap block then vb->dirty > >> > is set and every _vm_unmap_aliases() invocation flushes that dirty range > >> > over and over until that vmap block is completely freed, no? > >> > >> Something like the below would cure that. > >> > >> While it prevents that this is flushed forever it does not cure the > >> eventually overly broad flush when the block is completely dirty and > >> purged: > >> > >> Assume a block with 1024 pages, where 1022 pages are already freed and > >> TLB flushed. Now the last 2 pages are freed and the block is purged, > >> which results in a flush of 1024 pages where 1022 are already done, > >> right? > > > > This is good idea, I am thinking how to reply to your last mail and how > > to fix this. While your cure code may not work well. Please see below > > inline comment. > > See below. > > > One vmap block has 64 pages. > > #define VMAP_MAX_ALLOC BITS_PER_LONG /* 256K with 4K pages */ > > No, VMAP_MAX_ALLOC is the allocation limit for a single vb_alloc(). > > On 64bit it has at least 128 pages, but can have up to 1024: > > #define VMAP_BBMAP_BITS_MAX 1024 /* 4MB with 4K pages */ > #define VMAP_BBMAP_BITS_MIN (VMAP_MAX_ALLOC*2) > > and then some magic happens to calculate the actual size > > #define VMAP_BBMAP_BITS \ > VMAP_MIN(VMAP_BBMAP_BITS_MAX, \ > VMAP_MAX(VMAP_BBMAP_BITS_MIN, \ > VMALLOC_PAGES / roundup_pow_of_two(NR_CPUS) / 16)) > > which is in a range of (2*BITS_PER_LONG) ... 1024. > > The actual vmap block size is: > > #define VMAP_BLOCK_SIZE (VMAP_BBMAP_BITS * PAGE_SIZE) You are right, it's 1024. I was dizzy at that time. > > Which is then obviously something between 512k and 4MB on 64bit and > between 256k and 4MB on 32bit. > > >> @@ -2240,13 +2240,17 @@ static void _vm_unmap_aliases(unsigned l > >> rcu_read_lock(); > >> list_for_each_entry_rcu(vb, &vbq->free, free_list) { > >> spin_lock(&vb->lock); > >> - if (vb->dirty && vb->dirty != VMAP_BBMAP_BITS) { > >> + if (vb->dirty_max && vb->dirty != VMAP_BBMAP_BITS) { > >> unsigned long va_start = vb->va->va_start; > >> unsigned long s, e; > > > > When vb_free() is invoked, it could cause three kinds of vmap_block as > > below. Your code works well for the 2nd case, for the 1st one, it may be > > not. And the 2nd one is the stuff that we reclaim and put into purge > > list in purge_fragmented_blocks_allcpus(). > > > > 1) > > |-----|------------|-----------|-------| > > |dirty|still mapped| dirty | free | > > > > 2) > > |------------------------------|-------| > > | dirty | free | > > > You sure? The first one is put into the purge list too. No way. You don't copy the essential code here. The key line is calculation of vb->dirty. ->dirty_min and ->dirty_max only provides a loose vlaue for calculating the flush range. Counting more or less page of ->dirty_min or ->dirty_max won't impact much, just make flush do some menningless operation. While counting ->dirty wrong will cause serious problem. If you put case 1 into purge list, freeing it later will fail because you can't find it in vmap_area_root tree. Please check vfree() and remove_vm_area(). /* Expand dirty range */ vb->dirty_min = min(vb->dirty_min, offset); vb->dirty_max = max(vb->dirty_max, offset + (1UL << order)); vb->dirty += 1UL << order; Plesae note the judgement of the 2nd case as below: Means there's only free and dirty, and diryt doesn't reach VMAP_BBMAP_BITS, it's case 2. (vb->free + vb->dirty == VMAP_BBMAP_BITS && vb->dirty != VMAP_BBMAP_BITS) By the way, I made a RFC patchset based on your patch, and your earlier mail in which you raised some questions. I will add it here, please help check if it's worth posting for discussing and reviewing. > > /* Expand dirty range */ > vb->dirty_min = min(vb->dirty_min, offset); > vb->dirty_max = max(vb->dirty_max, offset + (1UL << order)); > > pages bits dirtymin dirtymax > vb_alloc(A) 2 0 - 1 VMAP_BBMAP_BITS 0 > vb_alloc(B) 4 2 - 5 > vb_alloc(C) 2 6 - 7 > > So you get three variants: > > 1) Flush after freeing A > > vb_free(A) 2 0 - 1 0 1 > Flush VMAP_BBMAP_BITS 0 <- correct > vb_free(C) 2 6 - 7 6 7 > Flush VMAP_BBMAP_BITS 0 <- correct > > > 2) No flush between freeing A and C > > vb_free(A) 2 0 - 1 0 1 > vb_free(C) 2 6 - 7 0 7 > Flush VMAP_BBMAP_BITS 0 <- overbroad flush > > > 3) No flush between freeing A, C, B > > vb_free(A) 2 0 - 1 0 1 > vb_free(C) 2 6 - 7 0 7 > vb_free(C) 2 2 - 5 0 7 > Flush VMAP_BBMAP_BITS 0 <- correct > > So my quick hack makes it correct for #1 and #3 and prevents repeated > flushes of already flushed areas. > > To prevent #2 you need a bitmap which keeps track of the flushed areas. I made a draft patchset based on your earlier mail, > > Thanks, > > tglx >