From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7C880C61DA4
	for <linux-mm@archiver.kernel.org>; Thu, 23 Feb 2023 09:12:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 057C06B0073; Thu, 23 Feb 2023 04:12:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id EFC286B0074; Thu, 23 Feb 2023 04:12:33 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D9D9C6B0075; Thu, 23 Feb 2023 04:12:33 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id C27446B0073
	for <linux-mm@kvack.org>; Thu, 23 Feb 2023 04:12:33 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 995C91412E2
	for <linux-mm@kvack.org>; Thu, 23 Feb 2023 09:12:33 +0000 (UTC)
X-FDA: 80497990986.08.221FC52
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf14.hostedemail.com (Postfix) with ESMTP id 94A7510000B
	for <linux-mm@kvack.org>; Thu, 23 Feb 2023 09:12:31 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="blY/zwqz";
	spf=pass (imf14.hostedemail.com: domain of berrange@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=berrange@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677143551;
	h=from:from:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=uwZaYUWCxFZTg3f1dH6MjHYdgF0IHgLHUPN7A+BCwIc=;
	b=onW+Cx0kZP5lSWFQOqmQ55xSbRQKHr2glqP4yF0ZwwZvpxZOwcl8xEhJ7y8GpgLrJgeyQ5
	ft4GtCrE8e+JGh1cGn2AapZ8qeGXYVsUU+tumNS+knSRdbTDtZj3abYuI0IQjWfybC3z6C
	BBn2XFehBAqKePxHD4gDy7fRM++UKgI=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="blY/zwqz";
	spf=pass (imf14.hostedemail.com: domain of berrange@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=berrange@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677143551; a=rsa-sha256;
	cv=none;
	b=hIt7ebuq+02F/YI1JEFRcCLtl9mdBbWGfr9hWHoSfdJMh1lbNzQQlD/+YkdP1wSj5FQ2vJ
	BlJOo6Ax9fgACDFc67G8FZWYVw9O8Eui71KRVEYmTdkl6S6Lb5L2/eq2z1snfU9XIxI+ro
	LTj2ncDGB+Zh6ud2wF7NMpMRYnX9j4c=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1677143550;
	h=from:from:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:in-reply-to:in-reply-to:  references:references;
	bh=uwZaYUWCxFZTg3f1dH6MjHYdgF0IHgLHUPN7A+BCwIc=;
	b=blY/zwqzTPk0BG4Yvgx8zlGqTAtUcr3kFaaEkiVyoEZXNDWmgCcVlhOslw4IK607n+x89g
	EpUHghI3/qH+VHbVAzvLxfmz8hBG2qunWlmSb+YPiM7vnCxfic+uOrWRHLyxjugJLTmRki
	ReXRNcTS19bfM1wgxzT99PW0moXJwuA=
Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com
 [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-344-ldKDf7H7OXWL4bvkUNpuzA-1; Thu, 23 Feb 2023 04:12:25 -0500
X-MC-Unique: ldKDf7H7OXWL4bvkUNpuzA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx02.redhat.com (Postfix) with ESMTPS id D2C8E800050;
	Thu, 23 Feb 2023 09:12:24 +0000 (UTC)
Received: from redhat.com (unknown [10.33.36.113])
	by smtp.corp.redhat.com (Postfix) with ESMTPS id AF0F31121314;
	Thu, 23 Feb 2023 09:12:21 +0000 (UTC)
Date: Thu, 23 Feb 2023 09:12:19 +0000
From: Daniel =?utf-8?B?UC4gQmVycmFuZ8Op?= <berrange@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Alistair Popple <apopple@nvidia.com>, Tejun Heo <tj@kernel.org>,
	Michal Hocko <mhocko@suse.com>, Yosry Ahmed <yosryahmed@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, jhubbard@nvidia.com,
	tjmercier@google.com, hannes@cmpxchg.org, surenb@google.com,
	mkoutny@suse.com, daniel@ffwll.ch,
	Alex Williamson <alex.williamson@redhat.com>,
	Zefan Li <lizefan.x@bytedance.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 14/19] mm: Introduce a cgroup for pinned memory
Message-ID: <Y/ct88JBeQuSmCuj@redhat.com>
Reply-To: Daniel =?utf-8?B?UC4gQmVycmFuZ8Op?= <berrange@redhat.com>
References: <Y/T/bkcYc9Krw4rE@slm.duckdns.org>
 <Y/UEkNn0O65Pfi4e@nvidia.com>
 <Y/UIURDjR9pv+gzx@slm.duckdns.org>
 <Y/Ua6VcNe/DFh7X4@nvidia.com>
 <Y/UfS8TDIXhUlJ/I@slm.duckdns.org>
 <Y/UiQmuVwh2eqrfA@nvidia.com>
 <87o7pmnd0p.fsf@nvidia.com>
 <Y/YRJNwwvqp7nKKt@nvidia.com>
 <87k009nvnr.fsf@nvidia.com>
 <Y/bHNO7A8T3QQ5T+@nvidia.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <Y/bHNO7A8T3QQ5T+@nvidia.com>
User-Agent: Mutt/2.2.9 (2022-11-12)
X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 94A7510000B
X-Rspam-User: 
X-Stat-Signature: 4uqxpiigchrnm4cmcq4ofjxiuask5j71
X-HE-Tag: 1677143551-740146
X-HE-Meta: U2FsdGVkX1/dQlx5qohPzav1AopjGa9+ievc6Ytm4ykjlwdg04H/mXbhnFHVYS+IbDkHSgEZicAa4/QuF4fbq3UhnGa6jfdbOhYVXrf4Z3wt3DIk0N5mUIVNgh8qexY8TBVCGgUi8U+iPGaHcvtc4ISO6FRRiutn8vCVpLqUI60+6pSkn5qScRGGWng3ajecXstgLqSVynDBfoaivappICsKxN7C/E9e6h6Y3B8UMf2SAxXJ8krbVdSyu8NWzkrWhkVxULz93nHUNUCNkSnymWXbSTZCCzecalneVi5jqnd6kzqT2TCtnU7UsepWO1V8pbAMqa/Q3dl4U1ZH5qW3ZkiBgarIj0CrxaXTPDIkchs4B3MT1zKb61m1/AUv1Tj3BjMFHhuxxErOa2ht6MDWXVeCDqdF5fIiggg/oI9C2ZtDQzNIl37cfYursHkvrjCNzf2+dkH1YGrKhhicTKWzdMsAJeb2ZKArTpadv5PwevA2On1/OytwpS20P3p/j/ycg2Cm4+OUUwAYWwbJJzvfDdh6ZiO90kgLiMr+sLtXR6QffNPb1FqHuY7bHVBaCj6rZz4+PjFf4bFax7J1vFIfLXe+r1buAFxkCAWA4CvOap8A2VtoJog7aXUFk3BRkCUaJo7jzjcFL/1kYRzzJGICxfBYmFY8+8icnTKmyhKJ+fJtiOOj5oq11UnqQskFOs0Q8zoL5nBsJKy5d6mPVcTtw2scq/x3Y1f4qPp5BM3gYDslUoEzw3EJY9aBxHL1+KGuLQPOt1rECCpaNAoE2ssfV8Zup47g4EOR3X4U1nVTqIiTK8APvEPmLN70vDiYXbZ4WJVQAbNbT6WgxNraBB00KGvI2cTvwBgUKTg3REzWM6eAiOPsy4QPRz0OwYvT08i73iXVxnLgp+hNPygAc/H0NWYULoVTl/WXW57vFH8BTw6Ubln9QCORTti5Fiz0ECzsJ3slepBfbqtk7mltcTV
 zkI4m3HM
 IMsJk1hWOCUEku0kDWSGlU618MBNqwsOfQsOLO29VruKXi89XHtLNmRlAbeuLEtOPfvgET4fvFpXJOru2ILnpyKnjPwN+L1Tpitl3fBG+Ok84cB0+EcDrE1mbIFOTw40WKILzGdu37CqRNor+XfGsP076vsAlkzZ2WA+P3n4BtG7h97hhDgr7n9NT21XVK3HMmRpEbotDFtHGOU2L01xr67vqPnDaf0bEy0xC341HEe7MB0qKF+lZElo+peli99dz7smX48mYIvVLI6PWjtaapB2GdK/+ZnGfJfAAmD/dOGv9PnfgPKQAztDtQBVsv2luymQ8/koWp5SzGs3J0vTGgg7b1TcYHQz6nW7pAY1TJuLahjKn2jjWjDB1ztXUTmHF7XZl/E8RV8NWCxWH29VazuTiFisMay517zeCQ83QlhM5xrYzawywQ+npcpbLSWEwmstoUj/ZpO3g0IRfzZNYIb8zmGcXKIMTMDboyT4wHM5AJRkGI/Mr9Q4zYiXI4zrwutVYfFYtRlOXuLoRc8Ol+ahEFheDtBEJO0HyRxSK5WwNxnY=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Feb 22, 2023 at 09:53:56PM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 23, 2023 at 09:59:35AM +1100, Alistair Popple wrote:
> > 
> > Jason Gunthorpe <jgg@nvidia.com> writes:
> > 
> > > On Wed, Feb 22, 2023 at 10:38:25PM +1100, Alistair Popple wrote:
> > >> When a driver unpins a page we scan the pinners list and assign
> > >> ownership to the next driver pinning the page by updating memcg_data and
> > >> removing the vm_account from the list.
> > >
> > > I don't see how this works with just the data structure you outlined??
> > > Every unique page needs its own list_head in the vm_account, it is
> > > doable just incredibly costly.
> > 
> > The idea was every driver already needs to allocate a pages array to
> > pass to pin_user_pages(), and by necessity drivers have to keep a
> > reference to the contents of that in one form or another. So
> > conceptually the equivalent of:
> > 
> > struct vm_account {
> >        struct list_head possible_pinners;
> >        struct mem_cgroup *memcg;
> >        struct pages **pages;
> >        [...]
> > };
> > 
> > Unpinnig involves finding a new owner by traversing the list of
> > page->memcg_data->possible_pinners and iterating over *pages[] to figure
> > out if that vm_account actually has this page pinned or not and could
> > own it.
> 
> Oh, you are focusing on Tejun's DOS scenario. 
> 
> The DOS problem is to prevent a pin users in cgroup A from keeping
> memory charged to cgroup B that it isn't using any more.
> 
> cgroup B doesn't need to be pinning the memory, it could just be
> normal VMAs and "isn't using anymore" means it has unmapped all the
> VMAs.
> 
> Solving that problem means figuring out when every cgroup stops using
> the memory - pinning or not. That seems to be very costly.
> 
> AFAIK this problem also already exists today as the memcg of a page
> doesn't change while it is pinned. So maybe we don't need to address
> it.
> 
> Arguably the pins are not the problem. If we want to treat the pin
> like allocation then we simply charge the non-owning memcg's for the
> pin as though it was an allocation. Eg go over every page and if the
> owning memcg is not the current memcg then charge the current memcg
> for an allocation of the MAP_SHARED memory. Undoing this is trivial
> enoug.
> 
> This doesn't fix the DOS problem but it does sort of harmonize the pin
> accounting with the memcg by multi-accounting every pin of a
> MAP_SHARED page.
> 
> The other drawback is that this isn't the same thing as the current
> rlimit. The rlimit is largely restricting the creation of unmovable
> memory.
> 
> Though, AFAICT memcg seems to bundle unmovable memory (eg GFP_KERNEL)
> along with movable user pages so it would be self-consistent.
> 
> I'm unclear if this is OK for libvirt..

I'm not sure what exact scenario you're thinking of when talking
about two distinct cgroups and its impact on libvirt. None the less
here's a rough summary of libvirt's approach to cgroups and memory

On the libvirt side, we create 1 single cgroup per VM, in which lives
at least the QEMU process, but possibly some additional per-VM helper
processes (swtpm for TPM, sometimes slirp/passt for NIC, etc).

Potentially there are externally managed processes that are handling
some resources on behalf of the VM. These might be a single centralized
daemon handling work for many VMs, or might be per VM services. Either
way, since they are externally managed, their setup and usage of cgroups
is completely opaque to libvirt.

Libvirt is only concerned with the 1 cgroup per VM that it creates and
manages. Its goal is to protect the host OS from a misbehaving guest
OS/compromised QEMU.

The memory limits we can set on VMs are somewhat limited. In general
we prefer to avoid setting any hard per-VM memory cap by default.
QEMU's worst case memory usage is incredibly hard to predict, because
of an incredibly broad range of possible configurations and opaque
behaviour/usage from ELF libraries it uses. Every time anyone has
tried hard memory caps, we've ended up with VMs being incorrectly
killed because they genuinely wanted more memory than anticipated
by the algorithm.

To protect the host OS, I tend to suggest mgmt apps/admins set a
hard memory limit acrosss all VMs in aggregate eg at /machine.slice,
instead of per-VM. This aims to makes it possible to ensure that
the host OS always has some memory reserved for its own system
services, while allowing the individual VMs to battle it out
between themselves.

We do still have to apply some tuning for VFIO, around what amount
of memory it is allowed to lock, but that is not so bad as we just
need to allow it to lock guest RAM which is known + an finite extra
amount, so don't need to take account of all of QEMU's memory
allocations in general. This is all still just in context of 1
cgroup though, as least as far as libvirt is aware. Any other
cgroups involved are opaque to libvirt, and not our concern as long
as QEMU's cgroup is preventing QEMU's misbehaviour as configured.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|