From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 82AE9C7EE2D
	for <linux-mm@archiver.kernel.org>; Mon, 22 May 2023 11:22:14 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 10B18900004; Mon, 22 May 2023 07:22:14 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 094C3900002; Mon, 22 May 2023 07:22:14 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E9F10900004; Mon, 22 May 2023 07:22:13 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id DAE6F900002
	for <linux-mm@kvack.org>; Mon, 22 May 2023 07:22:13 -0400 (EDT)
Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id AE160140D27
	for <linux-mm@kvack.org>; Mon, 22 May 2023 11:22:13 +0000 (UTC)
X-FDA: 80817652146.23.DC74A54
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf04.hostedemail.com (Postfix) with ESMTP id 5067940008
	for <linux-mm@kvack.org>; Mon, 22 May 2023 11:22:11 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=A+csVBOa;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf04.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1684754531;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=pDaZ2c1rvwFjJspGCCmnXHqKNNU/KcZVmeVIt9RYr9o=;
	b=gg9CK9/B0YvMBmkSZm0skgMGKdpefA/gpxUDkabEJ9nw9KGJTkTB3KA2tmIaGxBWTT64PJ
	Ji+y/rCLSi9mpCSEcfbeR8G1oRNVKnfds2fxeNgevUIhoioFkkHDSlo8A/jTA41U7Pdy1L
	zFje3lYKERm3eURz9pAYS2oheWdjCYM=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=A+csVBOa;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf04.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684754531; a=rsa-sha256;
	cv=none;
	b=pc5YBLEkWIpRW3S8g+/eVX9sVOyUdlt21up8chxjT80puzxGggIY5My6DYag5/2JmPo5Ec
	KhgV/Y0hlr0pTO+YdDPITimJP/MPvyNiMF9sFldS2lGp8zpvcfwwVXCqkyUPkfp/blNSWO
	RSu9jmAOyKqLoa3OE4+6cr/53fBKBQc=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1684754530;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=pDaZ2c1rvwFjJspGCCmnXHqKNNU/KcZVmeVIt9RYr9o=;
	b=A+csVBOasfYJ62Me2eHjoppDicStQQV4vWB86wP8uKqp7+hvpWX3ZnGg3snqQwn4mOvP38
	e+3GB7gK3hoglISxjROOYUhsfPPlnNdjY1JhnpQaItQqqkky1MNEe4u+d/7zdseO8EY1Z+
	WfWN+SboUUNDxEhYGMfSJ8bO48tQO5s=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-128-wgjqdCszP3-r6HHnDmnMyA-1; Mon, 22 May 2023 07:22:05 -0400
X-MC-Unique: wgjqdCszP3-r6HHnDmnMyA-1
Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 64DA185A5AA;
	Mon, 22 May 2023 11:22:04 +0000 (UTC)
Received: from localhost (ovpn-12-35.pek2.redhat.com [10.72.12.35])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id 283C7492B0B;
	Mon, 22 May 2023 11:22:02 +0000 (UTC)
Date: Mon, 22 May 2023 19:21:59 +0800
From: Baoquan He <bhe@redhat.com>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: "Russell King (Oracle)" <linux@armlinux.org.uk>,
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
	Christoph Hellwig <hch@lst.de>, Uladzislau Rezki <urezki@gmail.com>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Peter Zijlstra <peterz@infradead.org>,
	John Ogness <jogness@linutronix.de>,
	linux-arm-kernel@lists.infradead.org,
	Mark Rutland <mark.rutland@arm.com>, Marc Zyngier <maz@kernel.org>,
	x86@kernel.org, Nadav Amit <nadav.amit@gmail.com>
Subject: Re: [RFC PATCH 3/3] mm/vmalloc.c: change _vm_unmap_aliases() to do
 purge firstly
Message-ID: <ZGtQV9p7jEvz+mJN@MiWiFi-R3L-srv>
References: <87fs7w8z6y.ffs@tglx>
 <ZGM+hJ9SQiwU01Wl@shell.armlinux.org.uk>
 <874joc8x7d.ffs@tglx>
 <ZGNVZnBDT+GKrPCd@bhe.users.ipa.redhat.com>
 <87r0rg73wp.ffs@tglx>
 <87edng6qu8.ffs@tglx>
 <ZGdlff80rZ6v4LNO@MiWiFi-R3L-srv>
 <87cz2w415t.ffs@tglx>
 <ZGgKRGaIChzdnUqB@MiWiFi-R3L-srv>
 <87jzx1xou9.ffs@tglx>
MIME-Version: 1.0
In-Reply-To: <87jzx1xou9.ffs@tglx>
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
X-Rspamd-Queue-Id: 5067940008
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: kr8q7s6nxc4geph7umma5wdz1o1xr976
X-HE-Tag: 1684754531-7218
X-HE-Meta: U2FsdGVkX1+rEg1RFN15KSPeGlP4DkS1OeZjttzAmvgI60L59NonMvd2g++EdRCE4CK2LnuJqjFgP43UDXgc6urosgdFL3+K4M09XYV13bSCSqtwDB61bY59GpeZflW3lwTUPzzLTyP0IYvxjbVYSJCVwZpAEHCEVbW0Ab2jB+fevGgt8h3O7w+uny/25kmaf1BoOn0kcmFY9rShEBkRSkacVVd4DlvJJjCtD1S17aUm4Zhh89GHtuJ9vVwH/8h3tJ0sybmrh/rfTuZWgF7lyyhxGU86UtPlINIMIIywP6vklQgtAD0TJRNtjW8lwOmrWzSdWAUSGxY8RTJ1Brw6UdKUiMkXjKgxWfBvk5wiMSC3k42qkRauGvtnG+IyYlfJPsH/pDRtLSFlDrseWE77B2qWEBju2T2rA8rwlW4neOGQoZJK+Scj/iANuh+shz5E8oxUuSfdDDqDDOMziLbCF1oVUlglRb1nHbJcw4+DCCcz2VtE2CZKzgdNVwaKFKYt0WNaDnWeOXGSMiI5YW1/6nkP0ZYiWL3imuErVuQcdjdhc3orRUzK35X5pEo/Q0vdhYdhZ69B72ZsdAfyl2F5k9eSJBawbjPVLtmVHxIDDIsc154307ZcN3THcyIEZCaJTe8HV/XjVu61DxzwjqNf+fUDBp7txVrCxIbw0qTsHa1be0iozbBZ68GnZkXMfItlodoa1KUU1CQH5na/QPTh4uKI/VQrIyD0/vH7FrvuDMLzLlb6KXkCZx52HWGL9kyzJ06tKQqvYFRb6wVzEyAO6D5iPiJO4oeMA5hpWGh5cnvK/LhMn13DX1mur3w8PBUQ5s42HXadJPx2NmDTz90jWYMhD6QuSyceARFDTDOCSUMNEZ8iIJEOVONmBT47hVMjNAl5KdzFIXVlbzK54wJDvNVJgMmuBdHAD1mfOWMZBUjiAu1H7mBKpR8zNucp+rKIo7mIFpXnqrO87pGupIO
 F6UnDofn
 mOHwDJQjJtmCSE33f5I71qqeWJXdhr6XSzSogo6W2rcDRHOUK1tsNccUpoHdYO//dpNuue8UFqJFL0y5lGNm6MRDISQyahTnvj+zCWkvku5TmXYoWGTR93NqQHJGBpfypRdSHLua9NYjsStDrO/AyY5hxjRXj3c0r9cZ4VWy5GCmS3GiJ4Brm6OwX0xpChBv2sty8wwLwEAkVPgAQMBLCR4SIhB72nB4tjzU24vYBEGdmlT/LbioNRTvMERmF1RG6bycOpAvykZMUISdFezRKoPEvENOK/9OcRvKo+ea+fxd93O0WLK7438oTR19aqqyxNtlB
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 05/22/23 at 01:10am, Thomas Gleixner wrote:
> On Sat, May 20 2023 at 07:46, Baoquan He wrote:
> > On 05/19/23 at 08:38pm, Thomas Gleixner wrote:
> >> That block would have hundreds of pages left and is the most recently
> >> allocated. So the next vb_alloc() has to reallocate one instead of using
> >> the one which was allocated just before.
> >> 
> >> This clearly lacks a free space check so that blocks which have more
> >> free space than a certain threshold are not thrown away prematurely.
> >> Maybe it wants an age check too, so that blocks which are unused for a
> >> long time can be recycled, but that's an orthogonal issue.
> >
> > You are right, the vmap_block alloc/free does have the issue you pointed
> > out here. What I can defend is that it should be fine if
> > VM_FLUSH_RESET_PERMS memory doesn't upset the situation. As we see, the
> > lazy flush will only be triggered when lazy_max_pages() is met, or
> > alloc_vmap_area() can't find an available range. If these two happens,
> > means we really need to flush and reclaim the unmapped area into free
> > list/tree since the vmalloc address space has run out. Even though the
> > vmap_block has mach free space left, still need be purged to cope with
> > an emergency.
> >
> > So, if we pick VM_FLUSH_RESET_PERMS memory out and flush it alone, and
> 
> That might be counterproductive depending on the usage scenario
> especially as that BPF filtering is gaining traction.
> 
> > set a threshold for vmap_block purging, is it better?
> 
> The point is that there is no differentiation between:
> 
>   1) Purging freed areas, i.e. when the vmap_layz_nr hits the threshold,
>      from drain_vmap_area_work()
> 
>   2) Reclaiming vmalloc address space from pcpu_get_vm_areas()
> 
>   3) _unmap_aliases()
> 
> #1 You really don't want to reclaim vmap blocks from the per cpu free
>    lists, unless they have only a few free pages left and no active
>    mappings.
> 
>    There is no reason to zap vmap blocks unconditionally, simply because
>    reclaiming all lazy areas drains at least
> 
>       32MB * fls(num_online_cpus())
> 
>    per invocation which is plenty, no?
> 
> #2 You might want to try #1 first and if that does not help then reclaim
>    all vmap blocks on the per cpu lists which have no active mappings
>    unconditionally.
> 
> #3 _unmap_aliases() requires to touch everything because the caller has
>    no clue which vmap_area used a particular page and the vmap_area lost
>    that information too.
> 
>    Except for the vfree + VM_FLUSH_RESET_PERMS case, which removes the
>    vmap area first and then cares about the flush. That in turn requires
>    a full walk of _all_ vmap areas including the one which was just
>    added to the purge list.
> 
>    I can see the point that this is used to combine outstanding TLB
>    flushes and do the housekeeping of purging freed areas, but like #1
>    there is no real good reason to zap usable vmap blocks
>    unconditionally.
> 
> I'm all for consolidating functionality, but that swiss army knife
> approach of one fits it all does not make sense to me.

OK, got it.

> 
> >> The other issue I pointed out:
> >> 
> >> Assume the block has (for simplicity) 255 allocations size of 4 pages,
> >> again free space of 4 pages.
> >> 
> >> 254 allocations are freed, which means there is one remaining
> >> mapped. All 254 freed are flushed via __purge_vmap_area_lazy() over
> >> time.
> >> 
> >> Now the last allocation is freed and the block is moved to the
> >> purge_vmap_area_list, which then does a full flush of the complete area,
> >> i.e. 4MB in that case, while in fact it only needs to flush 2 pages.

Agree, it sounds problematic. With your cure code, we can check that
when we flush per va of purge list. If we can keep vmap_block() not
freed until it's checked and freed in purge list, we can flush the last
two pages of range in __purge_vmap_area_lazy(), please see next comment
about how we could do to fix it.

> >
> > It's easy to fix. For vmap_block, I have marked it in va->flag with
> > VMAP_RAM|VMAP_BLOCK. When flushing va in purge list, we can skip
> > vmap_block va.
> 
> How can you skip that? The last 2 pages in that VA still need to be
> flushed, no?
> 
> But VA has no information about already done partial flushes via the
> vmap_block, so this requires flushing the full VA range unconditionally.

Right, I see the problem you spotted with the illutrations. For the
last 2 pages in that VA, we can keep vmap_block until we iterate the
purge list in __purge_vmap_area_lazy(). Since you will iterate the
purge list to add each va range to an array, we can do like below
draft code to fix it. Surely this is on top of your cure code.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5ca55b357148..4b11a32df49d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1728,6 +1728,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	unsigned int num_purged_areas = 0;
 	struct list_head local_purge_list;
 	struct vmap_area *va, *n_va;
+	struct vmap_block vb;
 
 	lockdep_assert_held(&vmap_purge_lock);
 
@@ -1736,6 +1737,14 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	list_replace_init(&purge_vmap_area_list, &local_purge_list);
 	spin_unlock(&purge_vmap_area_lock);
 
+	vb = container_of(va, struct vmap_block, va);
+	if ((va->flags & VMAP_FLAGS_MASK == VMAP_RAM|VMAP_BLOCK) &&
+	    (vb->dirty_max)) {
+		s = vb->dirty_min << PAGE_SHIFT;
+		e = vb->dirty_max << PAGE_SHIFT;
+		kfree(vb);
+	}
+
 	if (unlikely(list_empty(&local_purge_list)))
 		goto out;
 
@@ -2083,7 +2092,6 @@ static void free_vmap_block(struct vmap_block *vb)
 	spin_unlock(&vmap_area_lock);
 
 	free_vmap_area_noflush(vb->va);
-	kfree_rcu(vb, rcu_head);
 }
 
 static void purge_fragmented_blocks(int cpu)
@@ -2163,11 +2171,6 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
 		vaddr = vmap_block_vaddr(vb->va->va_start, pages_off);
 		vb->free -= 1UL << order;
 		bitmap_set(vb->used_map, pages_off, (1UL << order));
-		if (vb->free == 0) {
-			spin_lock(&vbq->lock);
-			list_del_rcu(&vb->free_list);
-			spin_unlock(&vbq->lock);
-		}
 
 		spin_unlock(&vb->lock);
 		break;

> 
> That made me think about the following scenario:
> 
> vm_map_ram()
>   vb_alloc()
>     // Uses page P
>     vb->free -= 1UL << order;
>     if (vb->free == 0)
>       list_del_rcu(&vb->free_list);
> 
> vm_unmap_ram()
>   vb_free()
>     Does not free the last mapping of the vmap_block
>     Clears the page table entry for page P, but does not flush TLB
> 
> __free_page(P)
> 
> page P gets reused
> 
> vm_unmap_aliases()
> 
>   Does not find the VA which used page P because neither the VB is in
>   free_list nor the VA is in the purge_list
> 
>   Unless _vm_unmap_aliases() finds a large enough range to cover
>   page P or ends up with a flush_tlb_all() the stale TLB(s) persists.
> 
> No?
> 
> Same problem in this scenario:
> 
> addr = vm_map_ram()
>   vb_alloc()
>     vb->free -= 1UL << order;
>     if (vb->free == 0)
>       list_del_rcu(&vb->free_list);
> 
> set_memory_*(addr)
>   vm_unmap_aliases()
> 
>     Does not find the VA which contains @addr because neither the VB is in
>     free_list nor the VA is in the purge_list
> 
>     Unless _vm_unmap_aliases() finds a large enough range to cover
>     @addr or ends up with a flush_tlb_all() the stale TLB(s) persists.
> 
> No?
> 
> > I don't know how you will tackle the per va flush Nadav pointed out,
> > so I will not give a dtaft code.
> 
> It's not that hard.
> 
> An array of ranges which can be memcpy()'ed into that existing x86
> percpu flush info thing, which avoids the cache line issue Nadav is
> concerned of.
> 
> That array would have obviously a limited number of entries as anything
> with multiple ranges is most likely going to end up in a full flush
> anyway. The drain_vmap_area_work() case definitely does so as
> vmap_lazy_nr is guaranteed to be at least covering 32MB/PAGE_SIZE worth
> of pages.
> 
> I just got distracted and did not come around to finish it for various
> reasons. One of them is to make sense of this code :)

OK, looking forward to seeing it done when it's convenient to you.

> 
> >> Also these intermediate flushes are inconsistent versus how fully
> >> utilized blocks are handled:
> >> 
> >> vb_alloc()
> >>       if (vb->free == 0)
> >>           list_del(vb->free_list);
> >> 
> >> So all allocations which are freed after that point stay unflushed until
> >> the last allocation is freed which moves the block to the
> >> purge_vmap_area_list, where it gets a full VA range flush.
> >
> > That may be risky if stay unflushed until the last allocation is freed.
> > We use vm_map_ram() interface to map passed in pages into vmalloc area.
> > If vb_free() is called, the sub-region has been unmapped and user maybe
> > have released the pages. user of vm_unmap_aliases() may be impacted if
> > we don't flush those area freed with vb_free(). In reality, those areas
> > have been unmapped, while there's still TLB existing. Not very sure
> > about that.
> >
> > If we can hold the vmap_block flush until purging it w/o risk, it will
> > save us many troubles.
> 
> The point is that the TLBs _must_ be flushed
> 
>   1) Before the virtual address space occupied by a vmap_area is
>      released and can be reused.
> 
>      So one might argue that flushing TLBs early is good to detect use
>      after free.
> 
>      The DEBUG_PAGE_ALLOC case flushes right after
>      vunmap_range_noflush(), which is the proper thing to do for
>      debugging as any UAF will be immediately detected after the
>      function returns.
> 
>      The production case flushes lazy and fully utilized blocks are not
>      on the free list and therefore not flushed until they are
>      purged.
> 
>      The partial flush via _vm_unmap_aliases() solves a completely
>      different problem. See #2/#3 below
> 
>   2) Before a page, which might have been reused before the lazy flush
>      happened, can be used again. Also when the mapping attributes of a
>      page are changed via set_memory_*()
> 
>      That's what vm_unmap_aliases() -> _vm_unmap_aliases() is for. 
> 
>   3) When an executable mapping is torn down
> 
>      That's the vfree() + VM_FLUSH_RESET_PERMS case which also ends up
>      in _vm_unmap_aliases() 


I see it now, thanks.
> 
> I completely understand that there needs to be a balance between the
> high utilization use case and the occasional one which made us look into
> this. That's fine, but there needs to be a way to make it least overhead
> for both.

Agree. Based on all discussions, it should be much better if below three
things are done. I haven't considered if adding dirty bitmap in
vmap_block is necessary and acceptable.

1)A threshold for vmap_block purging, e.g 3/4 or 2/3 of VMAP_BBMAP_BITS;
2)your cure page + above draft code I just pasted;
3)per va range adding to an array and passed to flush_tlb_kernel_xx();

> 
> /me tries to find some spare cycles ...
> 
> Thanks,
> 
>         tglx
> 
> 
> 
>