From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AB88FC636D7
	for <linux-mm@archiver.kernel.org>; Tue, 21 Feb 2023 17:17:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id F1E896B0071; Tue, 21 Feb 2023 12:17:08 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id ECE386B0072; Tue, 21 Feb 2023 12:17:08 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D95F06B0078; Tue, 21 Feb 2023 12:17:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id CA71A6B0071
	for <linux-mm@kvack.org>; Tue, 21 Feb 2023 12:17:08 -0500 (EST)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id A0F49C06FE
	for <linux-mm@kvack.org>; Tue, 21 Feb 2023 17:17:08 +0000 (UTC)
X-FDA: 80491954536.25.C7DE66C
Received: from mail-qk1-f182.google.com (mail-qk1-f182.google.com [209.85.222.182])
	by imf26.hostedemail.com (Postfix) with ESMTP id B4B23140006
	for <linux-mm@kvack.org>; Tue, 21 Feb 2023 17:17:05 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=soleen.com header.s=google header.b=MwoMePWY;
	spf=pass (imf26.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.222.182 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1676999825;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=g7v1EPBM/+fgEUfM64faDcIwCwJ0N879e58mT9QXK8Y=;
	b=SoWwDqZ1QlUbWFWJO8rmTe3ssLMgO4spsiRHD0umfmE3NFT3w+eYEPFBuFs7xnVlRqjR7K
	ED/fACmji2y3jZ6AvQpvBTP+uaSlRdv/EBUkohFv4kdT6Npn06O5Lcx25rZGbRFe7zMAoo
	bVC4YRhQ0M6btZQiR7ZrXQZGnONAFIE=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=soleen.com header.s=google header.b=MwoMePWY;
	spf=pass (imf26.hostedemail.com: domain of pasha.tatashin@soleen.com designates 209.85.222.182 as permitted sender) smtp.mailfrom=pasha.tatashin@soleen.com;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676999825; a=rsa-sha256;
	cv=none;
	b=dHBlpOhCCO2ofwfGZAn1po0lHKpKGQAe1vU/VOx2yF+NUZfPjPftvJQDDCg8lM7mAmDFhg
	qYAA54GxMWN0EINWGgrB6Jb0of5sYnokzrgfxVG0yjoEhjMr34q4UOR1DE7vevsZQ3dfaJ
	znAa1P/KE7gq4xQBi/MtFt8g0wzE+mY=
Received: by mail-qk1-f182.google.com with SMTP id c200so909181qke.2
        for <linux-mm@kvack.org>; Tue, 21 Feb 2023 09:17:05 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=soleen.com; s=google;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=g7v1EPBM/+fgEUfM64faDcIwCwJ0N879e58mT9QXK8Y=;
        b=MwoMePWYGdkAvq6ljeqEacy0qlP/Cc6K55H/z9tKPugncTGlHAZqrxtT4+I6fH3jI/
         BWdZhYNwFRKLIE18BRJ5QAWOHQ7Z1Hu7Lj9GH/LysO2P+a492FyS6H7Lv9AwR9ccz3Vd
         7skhoa0ej/WBGyC2I9IviZxWWtCWbBcaTAeEnUjC3I3RnUUmOLTQ/IP2wKkD5HT7cma5
         LKjAgrzOOp+wtaAUMe/nUx6oUm2QGfH5UG0mr8g5Xy07tCmIdyYuTDuYKQrN3cOB72I8
         f6nrcCYxVf1RS4KgzgMYHw0fmTUX9d2TLPGM6XXNGbgevzYSyuIylq0Rwqm5ZrG1NbKa
         PNeg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=g7v1EPBM/+fgEUfM64faDcIwCwJ0N879e58mT9QXK8Y=;
        b=PYh0srlsleO/XouRPzeARYSn724MvF5rhOjzOHGeonXIcw10KFMZU9E4XgVpB8O1N4
         fEC/4RZaxihkYXAW7AG2u35g1/1Ro2/SLUeYfPXMUrS1Ja2dR5U4xHGBFlbneaY34Mte
         A10Iue7KXQWwYnOJoWMD4HC81Vb6F/QKvrUCRCB0tr6Z3bB3rUea7eyzq6J4uYaLwIsY
         MDhRt6W805xqnZxF6jxB0C5t0LGWUxx6CDiGypqqyDQfmE+VdbiiyUcRqfqy1jgmmoGw
         qnDnqeeDxU2SRiPQU0wB5l/P7hKY2Ts+crwTCB3/Vgyv0HznGQolO1CxEW+4VpyXJPau
         DWzg==
X-Gm-Message-State: AO0yUKU6UXRvseO3MPKzilPhrK6PNfHhQMXmN2cbDoqC0b/Wmc4hz0lR
	HuZozj+8BjFqAOPSZjKujcpqxTai0di4nO80YyeVR17v8CoLArRt
X-Google-Smtp-Source: AK7set+cPUyA5+LdxQbeW32tukc6zUfwtvXfzOaolKk/C4pwGyh4LPTdGPeCa5Fx9F912yp9yNPXaeACXLNANBjVG5Y=
X-Received: by 2002:ae9:e50f:0:b0:71f:b89c:5ac0 with SMTP id
 w15-20020ae9e50f000000b0071fb89c5ac0mr584888qkf.1.1676999824149; Tue, 21 Feb
 2023 09:17:04 -0800 (PST)
MIME-Version: 1.0
References: <CA+CK2bD5gztcyTm18cnznNi48o_G7H-F+LG=TSk=0WSGL39hrA@mail.gmail.com>
 <Y/TLLFPMRl6vzuO0@casper.infradead.org> <CA+CK2bC9j=M5i=HJS8XaYZOmaDyDWTVnRY+BRR7KP=e1m6WCqA@mail.gmail.com>
 <Y/TdvwRJDSApyv8D@casper.infradead.org>
In-Reply-To: <Y/TdvwRJDSApyv8D@casper.infradead.org>
From: Pasha Tatashin <pasha.tatashin@soleen.com>
Date: Tue, 21 Feb 2023 12:16:27 -0500
Message-ID: <CA+CK2bBgXwmOdhNMRkAH0y+jhzJfDzkBOZDkGVNnAFbyk0pNEw@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Single Owner Memory
To: Matthew Wilcox <willy@infradead.org>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: B4B23140006
X-Stat-Signature: e4x3a93gjaf6c4qaue3dd6ka5tfa3ax8
X-HE-Tag: 1676999825-74750
X-HE-Meta: U2FsdGVkX18oyCBkGEofdwnYvel85YYRe4Uf+KDnzu/0FXf9HGRluZdU0dFKFjs6qJVvq0gNklFb0wPwuoPB4SGPs9FjdNo8IOzLZbmvdl0XNxpqOG+LE4Q6JXc81yJ+iIgRcABpV9rqvtNMpoaA/sGqQFtAjaK8eBzHJvLwwvGy7wa/XrKuAz8pQdBwuEO3Qu3b6SfQzLr96JbkKeYo2QPpGcEadTr3fGmvHFm0HWBIIwK5T0UGEX2mn2rP1phkwsaF3MvKuymR9UK5SJQgN90v2kSKsLYadnO9M+m1LBOn0+plWtcTO1gdxVyszBJ32CrT3aZwHGl7Cun8Tt4Cm2GI1efSdGOjryrRLATTr0rEHMwOjP2K8NQEiBEoL/UW2G4NiHp/4ksv4i1AobcamwF/nImgbfG7TeJmEzxBuEswkR5DGMbQrUWxfgbTaqB8S0vCkr9CzR50Gh/u+mXfMLYERjf/Wly/pHHZXF6QDITIQ+un2gkCQBRu4BemrjzJ+n+TyGUGihPPc+FMiM1KIiNAFlY5U5K3hNaJLdsMdB2To+3zdfqncNCwvl3mCxPbBzZsneJugpzpxRcwPSPjFbY/XI/FK6BdIu1yrqZZXN/ZZCeNHG+03C4ZSOA3jAK8XuXENv00AAS7+VjPzIudOxH/HYHHA/t7xOEFn7lIxQBNe2YHGCPldFajartfDuUlUcqJ/RQnN6JCdTN/juSFYKxvI0voPAY3u53duoz0Tr2FGP0sRTg3IyG8qOpxKYlD8lq9x3DiCVVjD3irc72yzCqKQOasMoWBS6q0po7I9oSdYtZyb1nxKzjAytneuW3saJtKv+7v1v9vMcpXZgRTYWISOmWRUQRQgsnzI1qMK+1A77ORHWO9gNMrVlU3HERNB2YJPwfEBu2cN8R2P2AWqdi2mvMI+SLSwjrNhzlIld28k+gM5jQZKcUcI4vwUoLCrWWFhID+0ikqYXjrJ7r
 X/Zj8zf7
 bYpuqHe4LfrRNR2dWKaCqGaqvPsmNQn5BYTCNZMnVTTUR9sG7nvwHQTGR9TGwoysh3mBYBdpq+seED4JRNYes+HYRFqy2mdAbMXxA8OGOwg/4v1NfdqRfAPwOhkcnZ66MzU+go9kTKtLf1xH2SRHcQ1/Iq0GGvP+h7yVvS0Wr6ZPvORFr74PkfziHY0dqCO9eo2x2/UridjkZit0thAkbWYGVTUdrpkl3qbI/V5bODdfizUA1A7YyN442RF12A8WTF4MR/VHqMdDcywIFCzBcmxbeQLuuzhizswUy16bbQouvO3FpJqEWnxcctKRrkn40tpK2yS72PVQp5LlivK+UqGHuX8Be118uM0/xvUms37y8W4k=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Feb 21, 2023 at 10:05 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Feb 21, 2023 at 09:37:17AM -0500, Pasha Tatashin wrote:
> > Hey Matthew,
> >
> > Thank you for looking into this.
> >
> > On Tue, Feb 21, 2023 at 8:46 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Feb 20, 2023 at 02:10:24PM -0500, Pasha Tatashin wrote:
> > > > Within Google the vast majority of memory, over 90% has a single
> > > > owner. This is because most of the jobs are not multi-process but
> > > > instead multi-threaded. The examples of single owner memory
> > > > allocations are all tcmalloc()/malloc() allocations, and
> > > > mmap(MAP_ANONYMOUS | MAP_PRIVATE) allocations without forks. On the
> > > > other hand, the struct page metadata that is shared for all types of
> > > > memory takes 1.6% of the system memory. It would be reasonable to find
> > > > ways to optimize memory such that the common som case has a reduced
> > > > amount of metadata.
> > > >
> > > > This would be similar to HugeTLB and DAX that are treated as special
> > > > cases, and can release struct pages for the subpages back to the
> > > > system.
> > >
> > > DAX can't, unless something's changed recently.  You're referring to
> > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
> >
> > DAX has a similar optimization:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.2&id=e3246d8f52173a798710314a42fea83223036fc8
>
> Oh, devdax, not fsdax.
>
> > > > The proposal is to discuss a new som driver that would use HugeTLB as
> > > > a source of 2M chunks. When user creates a som memory, i.e.:
> > > >
> > > > mmap(MAP_ANONYMOUS | MAP_PRIVATE);
> > > > madvise(mem, length, MADV_DONTFORK);
> > > >
> > > > A vma from the som driver is used instead of regular anon vma.
> > >
> > > That's going to be "interesting".  The VMA is already created with
> > > the call to mmap(), and madvise has not traditionally allowed drivers
> > > to replace a VMA.  You might be better off creating a /dev/som and
> > > hacking the malloc libraries to pass an fd from that instead of passing
> > > MAP_ANONYMOUS.
> >
> > I do not plan to replace VMA after madvise(), I showed the syscall
> > sequence to show how Single Owner Memory can be enforced today.
> > However, in the future we either need to add another mmap() flag for
> > single owner memory if that is proved to be important or as you
> > suggested  use ioctl() through /dev/som.
>
> Not ioctl().  Pass an fd from /dev/som to mmap and have the som driver
> set up the VMA.

Good point, using fd is indeed better, and can be made accessible to
more users without changes.

>
> > > > The discussion should include the following topics:
> > > > -  Interaction with folio and the proposed struct page {memdesc}.
> > > > - Handling for migrate_pages() and friends.
> > > > - Handling for FOLL_PIN and FOLL_LONGTERM.
> > > > - What type of madvise() properties the som memory should handle
> > >
> > > Obviously once we get to dynamically allocated memdescs, this whole
> > > thing goes away, so I'm not excited about making big changes to the
> > > kernel to support this.
> >
> > This is why the changes that I am thinking about are going to be
> > mostly localized in a separate driver and do not alter the core mm
> > much. However, even with memdesc, today the Single Owner Memory is not
> > singled out from the rest of memory types (shared, anon, named), so I
> > do not expect the memdescs can provide saving or optimizations for
> > this specific use case.
>
> With memdescs, let's suppose the malloc library asks for a 256kB
> allocation.  You end up using 8 bytes per page for the memdesc pointer
> (512 bytes) plus around 96 bytes for the folio that's used by the anon
> memory (assuming appropriate hinting / heuristics that says "Hey, treat
> this as a single allocation").

Also, the 256kB should be physically contiguous, right? Hopefully,
fragmentation is not going to be an issue, but we might need to look
into increasing the page migration enforcements in order to reduce
fragmentations  during allocs, and thus reduce the memory overheads.
Today, fragmentation can potentially reduce the performance when THPs
are not available but in the future with memdescs the fragmentation
might also effect the memory overhead. We might need to look into
changing some of the migration policies.

>  So that's 608 bytes of overhead for a
> 256kB allocation, or 0.23% overhead.  About half the overhead of 8kB
> per 2MB (plus whatever overhead the SOM driver has to track the 256kB
> of memory).

I like the idea of memdescs, and would like to stay involved in the
project development.  The potential memory savings are indeed
substantial.

> If 256kB isn't the right size to be doing this kind of analysis on, we
> can rerun it on whatever size you want.  I'm not really familiar with
> what userspace is doing these days.
>
> > > The savings you'll see are 6 pages (24kB) per 2MB allocated (1.2%).
> > > That's not nothing, but it's not huge either.
> >
> > This depends on the scale, in our fleet 1.2% savings are huge.
>
> Then 1.4% will be better, yes?  ;-)

Absolutely, 1.4% is even better. I mean 0% kernel  memory overhead
would be just about perfect :-)

Let me provide with a few more reasons how /dev/som can be helpful:

1. Independent memory pool.
While /dev/som itself always manages memory in 2M chunks it can be
configured to use memory from HugeTLB (2M or 1G), devdax, or kernel
external memory (i.e. memory that is not part of System RAM).

2. Low overhead
/dev/som will allocate memory from the pool in 1G chunks, and manage
it in 2M chunks. This will allow low memory overhead management via
bitmaps. List/tree of 2M chunks are going to be per user process, from
where the faults on som vmas are going to be handled.

3. All pages are migratable
Since som manages only user pages all pages are going to be required
to be migratable. In order to support FOLL_LONGTERM we will need to
make a decision either to migrate page to become a normal page (i.e.
Core-MM managed), or add a separate pool of long term pinned pages.
Even in today's kernel when we FOLL_LONGTERM a page it is migrated
outside of ZONE_MOVABLE.

4. 1G anonymous pages.
Since all pages are migratable, support for 1G anonymous pages can be
implemented. Unlike with Core-MM where THPs do not have struct page
optimizations, the som 4k, 2M, and 1G pages will all have reasonable
overhead from the beginning.

5. Performance benefit for running /dev/som in a virtual machine
The way extended page tables work is that translation cost in terms of
number of loads is not a simple summation of native page table +
extended page table, but actually (n * m + n +m) where n is number
page table levels on the guest, and m is number of page table levels
in extended page table. This is because the guest page table levels
themselves must be translated into host physical addresses using
extended page tables.

Since /dev/som allows for 1G anonymous pages, we can use guest
physical memory as virtual memory: i.e. only subset of 1G page in the
guest is actually backed by physical pages on the host, yet the access
to that subset is going to be substantially faster due to fewer page
table loads, and less TLB miss rate. I am proposing to have a separate
talk about this and other VM optimizations:
https://lore.kernel.org/linux-mm/CA+CK2bDr5Xii021JBXeyCEY4jjWCsZQ=ENa-s8MLkBv5hYUvsA@mail.gmail.com/

6. Security
There is a reduced risk of false sharing pages because they are
enforced to be single owner pages. This can help with avoiding some
bugs that we've seen in the past with refcount errors, for which I
wrote page_table_check, that since caught a few false sharing issues.