From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A0BBAC61DA4
	for <linux-mm@archiver.kernel.org>; Wed, 22 Feb 2023 15:44:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 193D56B0072; Wed, 22 Feb 2023 10:44:38 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 144986B0073; Wed, 22 Feb 2023 10:44:38 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 00BCB6B007D; Wed, 22 Feb 2023 10:44:37 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id E5E5E6B0072
	for <linux-mm@kvack.org>; Wed, 22 Feb 2023 10:44:37 -0500 (EST)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id AE6D4A071C
	for <linux-mm@kvack.org>; Wed, 22 Feb 2023 15:44:37 +0000 (UTC)
X-FDA: 80495350194.10.78AF445
Received: from mail-qv1-f43.google.com (mail-qv1-f43.google.com [209.85.219.43])
	by imf02.hostedemail.com (Postfix) with ESMTP id 8CC938000F
	for <linux-mm@kvack.org>; Wed, 22 Feb 2023 15:44:35 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=soleen.com header.s=google header.b=ofpfZ1yB;
	spf=pass (imf02.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677080675;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1jkywtyluE/Dbgmwlh9fYGlSVluKbVa2t66CWLu0g+k=;
	b=upWFYNYZlzSLcme6PugUp/JN6AIyFYn05UTGuqAqRMJo2C+o+Hc9oUL0ybo+Q0gf7/JGdp
	4wnFjNBVo2d8DZikGrQaTlKg/2ZWgiur6tdSZ4+1+DrwtG6o4R770NPt1bx3StAnlBmSRs
	7xj3ucC0l0Xgx/+S2C88LJfMRirpOaw=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=soleen.com header.s=google header.b=ofpfZ1yB;
	spf=pass (imf02.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.219.43 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677080675; a=rsa-sha256;
	cv=none;
	b=pJoAE/jFBIIjLsS2BjjVAaQVL053g2SWAUVY5EaLdtldlTtKtgDVKuLa003l4+9BvKlJUR
	zNTkImf1FXOea8rE0fUJo2ifAYTQVH2fLPsGLuhN1+Mi/ZrIuFSmQVbogeaKgei8ymXi99
	0Hup9FdK4sCNEBPJ2NeeZdowNxQEViM=
Received: by mail-qv1-f43.google.com with SMTP id ev13so8868943qvb.10
        for <linux-mm@kvack.org>; Wed, 22 Feb 2023 07:44:35 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=soleen.com; s=google; t=1677080674;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=1jkywtyluE/Dbgmwlh9fYGlSVluKbVa2t66CWLu0g+k=;
        b=ofpfZ1yBt2yEUcEfDWWGFgKOTalOppIANe1sJb/v035ABpz98ynzBui4fbi5rRE86L
         L/rfSqnUPcAOR271dgYlzD/oGPVEZaw/r1q6w2wgyQtXwpOEH0FX8DuvNMrQpj5L2Jho
         AUc4y4K70aiGlKg/++mf+W41s8Br84N9mfAoRNGrgJk7lwnEQIGZjcY9C01b8xRD5luT
         mP/JhnHSKFUFJ6lVe0lfm++Y5TJafmhDmNMy9ri1FgMa4QEfLcFhBG6iqoSN+UtnKzAI
         m1fPoa076NYAbzkqTMCQFSQO+rJtxqwT0Ddoasp3RGM+P69FI5iVGoOlucSS6ERDxaCJ
         It8g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1677080674;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=1jkywtyluE/Dbgmwlh9fYGlSVluKbVa2t66CWLu0g+k=;
        b=Y4Cm8WIfM/HZfMUuUzaXcYNkZ5DVvzQbgEE02XbwsYX4NASks6swcNaELCpjZnNw4W
         laAkT0qLHM8Q7J8tm+SYEqxi9GLx71F/q0feYxOFtUrLDVWoBZmMrdWVLNmd7SXqDFxP
         uFwpkCoVP7m/RZNjsEQ2pYR772zbpEheTkfL52NHkoBqobBJ4UFf/rnEOiD24K2XONto
         s+2VMCfN3dRTJ9N2CqXHPW8DBFTZjM1h8hJROaKgeBOWRzzUqH7K3GNY43v3x+4MzNGZ
         SFJOUaI/Oj8BMqAVBnqJCsWCanv2zvBX+w2v5Pk5r8QH4l06PcSOFRpe8ZxSmiB5AAkE
         v/9g==
X-Gm-Message-State: AO0yUKXE49S6sArsHhW3h35EQNNGBDwIAMYe4J0AR/neifQIfaoZeaPu
	TBLVdZNm97KG8DvqJW03eOe05QlubEtQMfHGrpruiw==
X-Google-Smtp-Source: AK7set8BjChvQuN9lgY8Ppf/hwyzsQR/GpePP782eTEj6O+GTJxPsdnt54WSVyONBXxLcYLxbQKFSfpgj2PvXW8TzAE=
X-Received: by 2002:a0c:e2c8:0:b0:570:7e91:3927 with SMTP id
 t8-20020a0ce2c8000000b005707e913927mr1381668qvl.76.1677080674518; Wed, 22 Feb
 2023 07:44:34 -0800 (PST)
MIME-Version: 1.0
References: <CA+CK2bDr5Xii021JBXeyCEY4jjWCsZQ=ENa-s8MLkBv5hYUvsA@mail.gmail.com>
 <fa6d20bd-13d4-9233-758e-012eff558ec3@redhat.com> <CA+CK2bAQtOsAjWjBA-nMArGEW=O_oUJbKizvgpBEssmKQGXZMA@mail.gmail.com>
 <31ECDA2B-52E2-4AB5-83CA-D0CD9342A3B9@nvidia.com>
In-Reply-To: <31ECDA2B-52E2-4AB5-83CA-D0CD9342A3B9@nvidia.com>
From: Pasha Tatashin <pasha.tatashin@soleen.com>
Date: Wed, 22 Feb 2023 10:43:57 -0500
Message-ID: <CA+CK2bCSAVzdFtR4M0nQKo2ZyR_0yRmimO2dsQ9Q-598SDx2Wg@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Virtual Machine Memory Passthrough
To: Zi Yan <ziy@nvidia.com>
Cc: Gavin Shan <gshan@redhat.com>, lsf-pc@lists.linux-foundation.org, 
	linux-mm <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 8CC938000F
X-Stat-Signature: c85ox5wjzgcqmre3hetuq5jkn9abx741
X-HE-Tag: 1677080675-347585
X-HE-Meta: U2FsdGVkX1/kogWcON1CaUNm0aIDqPb3Lb4SLfss3uGbf69b4HHlzahk+k+NhQiSq1GJRjxjshTRAzgipadji60X8LD13zjG1NGOF9R8KHBL8e6kZTXx+JmBM5ZdVXHsSK5lbwF1xdHV4zeLFYlV3dPGWtzo886ZA3kfCK0Inhy9j7wnHky3Xa3qVeaRPDvec4pu9xqyPV0CPtFjjT/Is+e3hpojeozD7NRq25ixdrYGi2QwnnpW2As5MQL1BqTSTsmkByTbr3iX816N4/xjOJx+yywm49T3wcCSEXmEZ+iLrvgE9yiwjRSBs/wo8EFkBMqex+BoNJRejoU8vuggvSYYr/47VXxyaSzd16SkTYi1TJzUPKh8b8iF8/URS+Yz2qBx6EqTC74c9EeUO46uW6TL3oxi/+IidvlEbYtOP1s7SfGezdssE7HAXHO+8xiCCvbjomzpTMxxzeOKAJW0F8GuV3P57FAqmYubYnV1HmzOxc8TEy6G9T1d8UEeZpUoUslFVmWioLktEebhpyylWOUhtNQtpjUaS5nOt+S/bf4fAtObKqRsN4gjEnZGZUjSN5Y/U3zpFKnp51+n41h2pipT9de3WPO4dpmubYIiBu7QJdb02HM3E3A+1ywlutVR7kSZU2qSuOg4NGdR6meoKxk/b8GbmVOd6nko2q7ROBDG3uWNoHUw/kO7eSy5OT2xVZ0OMWmygM/vLKEl5bNj29uGMkL8PCVlO4bqAuFvbQQT9QNioRGXfLUb1WYxOYytqMOZiWXdB2f9hnzZrqSstCiwH42kpqcA2rL1Fjy/FvoUQCbHSAqJWN/wXhpzl95KmNIAIZU67Cpt1dDsumEGqUwJW0pGgNVcfPkzkL8PkyocqqyPw7bh69Z06D3sZZK0Sb0N98RldO43UT1Nq2o6TOSqK6wb/9TLMTMbqmSVz1rsyWxOY1U4Uvl1tsEPsuit8D4VJOakxk9jEafjI5v
 5gffR30Q
 SWNk4IdsMWeTtUi3et6rdUfEIlJMBsgujtjrRyGMYpnDmLAaKmg8YgjKDD+429KVrAjqoZKA4A8g7HZYtYvYN704l36wgNql6dB30N6Ft4C6b7IHZLdQS7iWqDgYEpBFMg6+gQYx+2yVIfImJgdL3c+nk+euT/MSsuSJrENgAMNwrbyhekq0EJ0LcBvOZPLXfdlMky4k90S3+s7C/kOpI/ncRoFQxS+DhnfUNmr83WaZbaC0hVbAF6yNaMNJp509b/sfpavQeGvqunGU=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Feb 22, 2023 at 10:31 AM Zi Yan <ziy@nvidia.com> wrote:
>
> On 22 Feb 2023, at 8:43, Pasha Tatashin wrote:
>
> > On Mon, Feb 20, 2023 at 6:51 PM Gavin Shan <gshan@redhat.com> wrote:
> >>
> >> Hi Pasha,
> >>
> >> On 2/21/23 3:31 AM, Pasha Tatashin wrote:
> >>>
> >>> As a part of an ongoing work of replacing some containerized work load
> >>> with virtual machines within Google, I have worked on making the
> >>> memory translations faster.
> >>>
> >>> I would like to propose the following topic for this year's LSF/MM/BPF:
> >>>
> >>> Discuss  a set of techniques that can improve the guest performance,
> >>> memory footprint overhead, observability, and manageability of virtual
> >>> machines by hypervirtualizing the guest memory to the extreme. The end
> >>> goal is to allow very lightweight virtual machines to be closer in
> >>> performance to the containers.
> >>>
> >>> The following items are going to be discussed in this topic:
> >>> - Reducing the cost of SLAT page table translations.
> >>> - Reducing the memory footprint overhead.
> >>> - Reducing the memory management overhead.
> >>> - Increasing the observability of guest memory.
> >>>
> >>
> >> It's all about to understand the problem and possible solution or directions.
> >>
> >> I googled for 'SLAT' and direct me to x86's EPT. ARM64 has similar thing called
> >> stage-2 page table. The usual way to reduce page table translation cost is to map
> >> the contiguous memory through PUD/PMD. I'm not sure if there are other solutions
> >> we're heading for?
> >>
> >> Guest's memory is usually backed up by virtual memory area (VMA), which is either
> >> a anonymous or hugetlb region. As I understand, the page fault handling is excessive
> >> to populate the requested memory. I'm not sure if reducing the memory management
> >> overhead is to get it faster, or something else? :)
> >
> > Hi Gavin,
> >
> > In a non-virtualized environment, when converting VA to PA, we load
> > each level of page table, so converting to a 4K page takes 4 or 5
> > loads, depending on the page table type used. However, in a
> > virtualized environment, the number of loads to convert guest VA to
> > host PA is not a summation of SLAT page table levels and Guest page
> > table levels; rather, it is equal to: n*m + n + m. This is because
> > each guest's page table level must also be converted from guest PA to
> > host PA.
> >
> > One way to minimize the number of loads is for the guest to use huge
> > pages, for example, 1-Gbyte pages. However, this normally wastes a lot
> > of memory. The idea is that we can use guest physical memory in a
> > virtual way: create 1-Gbyte pages that are only partially backed by
> > host memory, yet improve the access performance due to fewer TLB
> > misses and faster translations through guest + SLAT page tables. I
> > would like to discuss how this can be achieved.
>
> Do you mean allocating 1GB pages in the guest and backing them using
> 2MB and/or 4KB pages in the host? From my understanding, for virtual

Yes, that is exactly right. However, backing only a subset of the 1G
page, and not zero the whole page on allocation or first fault in the
guest, fault on demand on host, as new parts of 1G page are touched.

> machines, TLB caches guestVA to hostPA, so the number of TLB entries
> would be the same as using 2MB or 4KB pages in the guest (as long as
> the guest page and the host page backing it have the same size).
> What am I missing here?

Yes, the way TLB works is the smallest page of host and guest is the
size of TLB entry. So, 1G guest pages and 2M host pages yield to 2M
TLB entries, and  1G guest pages and 4K host pages yield to 4K TLB
entries. The sabing is coming from always having 1G pages in the
guest, and if the host backs with 2M pages, the 2M TLB entries are
used.

>
> For a TLB miss, it will be faster since fewer page table walks are
> needed for 1GB pages in the guest.

That is exactly right, the faster page table walk or SLAT translation
is achieved with this approach.

Thanks,
Pasha