linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Eric Biggers <ebiggers@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>,
	Christoph Hellwig <hch@infradead.org>,
	linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-bcachefs@vger.kernel.org,
	Kent Overstreet <kent.overstreet@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	linux-mm@kvack.org
Subject: Re: [PATCH 07/32] mm: Bring back vmalloc_exec
Date: Fri, 12 May 2023 14:36:13 -0400	[thread overview]
Message-ID: <ZF6HHRDeUWLNtuL7@moria.home.lan> (raw)
In-Reply-To: <20230510064849.GC1851@quark.localdomain>

On Tue, May 09, 2023 at 11:48:49PM -0700, Eric Biggers wrote:
> What seems to be missing is any explanation for what we're actually getting from
> this extremely unusual solution that cannot be gained any other way.  What is
> unique about bcachefs that it really needs something like this?

Ok, as promised:

Background: all metadata in bcachefs is a structured as key/value pairs,
and there's a common key format for all keys.

struct bkey {
	/* 3 byte header */
	u8		u64s;		/* size of k/v in u64s */
	u8		format;		/* packed/unpacked, needs_whiteout */
	u8		type;		/* value type */
	u8		pad;

	/*
	 * Order of fields below is for little endian, they're in
	 * reverse order on big endian (and byte swabbed as necessary
	 * when reading foreign endian metadata)
	 * 
	 * Since field order matches byte order, the key can be treated
	 * as one large multi word integer for doing comparisons:
	 */
	u96		version;	/* nonces, send/recv support */
	u32		size;		/* size of extent keys */

	/* Below are the field used for ordering/comparison: */
	u32		snapshot;	
	u64		offset;
	u64		inode;

	/* Value is stored inline with key */
	struct bch_val	v;
};

sizeof(struct bkey) == 40.

An extent value that has one pointer and no checksum is 8 bytes, with
one pointer and one 32 bit checksum 16 bytes, for 56 bytes total (key
included).

But for a given btree node, most of the key fields will typically be
redundandant. An extents leaf node might have extents for all one inode
number or a small range of inode numbers, snapshots may or may not be in
use, etc. - clearly some compression is desirable here.

The key observation is that key compression is possible if we have a
compression function that preserves order, and an order-preserving
compression function is possible if it's allowed to fail. That means we
do comparisons on packed keys, which lets us skip _most_ unpack
operations, for btree node resorts and for lookups within a node.

Packing works by defining a format with an offset and a bit width for
each field, so e.g. if all keys in a btree node have the same inode
number the packed format can specify that inode number and then a field
width of 0 bits for the inode field.

Continuing the arithmetic from before, a packed extent key will
typically be only 8 or 16 bytes, or 24-32 including the val, which means
bkey packing cuts our metadata size roughly in half.

(It also makes our key format somewhat self describing and gives us a
mechanism by which we could add or extend fields in the future).

-----------------------------------------------------

As mentioned before, since packed bkeys are still multi-word integers we
can do some important operations without unpacking, but to iterate over
keys, compare packed & unpacked keys in resort, etc. - we'll still need
to unpack, so we need this operation to be as fast as possible.

bkey.c __bch2_bkey_unpack_key() is the unspecialized version written in
C, that works on any archictecture. It loops over the fields in a
bkey_format, pulling them out of the input words and adding back the
field offsets. It's got the absolute minimum number of branches - one
per field, when deciding to advance to the next input word - but it
can't be branchless and it's a whole ton of shifts and bitops.

dynamic codegen lets us produce unpack functions that are fully
branchless and _much_ smaller. For any given btree node we'll have a
format where multiple fields have 0 field with - i.e. those fields are
always constants. That code goes away, and also if the format can be
byte aligned we can eliminate shifts and bitopts. Code size for the
dynamically compiled unpack functions is roughly 10% that of the
unspecialized C version.

I hope that addresses some of the "what is this even for" questions :)

Cheers,
Kent


  reply	other threads:[~2023-05-12 18:36 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
2023-05-09 18:19   ` Lorenzo Stoakes
2023-05-09 20:15     ` Kent Overstreet
2023-05-09 20:46   ` Christoph Hellwig
2023-05-09 21:12     ` Lorenzo Stoakes
2023-05-09 21:29       ` Kent Overstreet
2023-05-10  6:48         ` Eric Biggers
2023-05-12 18:36           ` Kent Overstreet [this message]
2023-05-13  1:57             ` Eric Biggers
2023-05-13 19:28               ` Kent Overstreet
2023-05-14  5:45               ` Kent Overstreet
2023-05-14 18:43                 ` Eric Biggers
2023-05-15  5:38                   ` Kent Overstreet
2023-05-15  6:13                     ` Eric Biggers
2023-05-15  6:18                       ` Kent Overstreet
2023-05-15  7:13                         ` Eric Biggers
2023-05-15  7:26                           ` Kent Overstreet
2023-05-21 21:33                             ` Eric Biggers
2023-05-21 22:04                               ` Kent Overstreet
2023-05-15 10:29                 ` David Laight
2023-05-10 11:56         ` David Laight
2023-05-09 21:43       ` Darrick J. Wong
2023-05-09 21:54         ` Kent Overstreet
2023-05-11  5:33           ` Theodore Ts'o
2023-05-11  5:44             ` Kent Overstreet
2023-05-13 13:25       ` Lorenzo Stoakes
2023-05-14 18:39         ` Christophe Leroy
2023-05-14 23:43           ` Kent Overstreet
2023-05-15  4:45             ` Christophe Leroy
2023-05-15  5:02               ` Kent Overstreet
2023-05-10 14:18   ` Christophe Leroy
2023-05-10 15:05   ` Johannes Thumshirn
2023-05-11 22:28     ` Kees Cook
2023-05-12 18:41       ` Kent Overstreet
2023-05-16 21:02         ` Kees Cook
2023-05-16 21:20           ` Kent Overstreet
2023-05-16 21:47             ` Matthew Wilcox
2023-05-16 21:57               ` Kent Overstreet
2023-05-17  5:28               ` Kent Overstreet
2023-05-17 14:04                 ` Mike Rapoport
2023-05-17 14:18                   ` Kent Overstreet
2023-05-17 15:44                     ` Mike Rapoport
2023-05-17 15:59                       ` Kent Overstreet
2023-06-17  4:13             ` Andy Lutomirski
2023-06-17 15:34               ` Kent Overstreet
2023-06-17 19:19                 ` Andy Lutomirski
2023-06-17 20:08                   ` Kent Overstreet
2023-06-17 20:35                     ` Andy Lutomirski
2023-06-19 19:45                 ` Kees Cook
2023-06-20  0:39                   ` Kent Overstreet
2023-06-19  9:19   ` Mark Rutland
2023-06-19 10:47     ` Kent Overstreet
2023-06-19 12:47       ` Mark Rutland
2023-06-19 19:17         ` Kent Overstreet
2023-06-20 17:42           ` Andy Lutomirski
2023-06-20 18:08             ` Kent Overstreet
2023-06-20 18:15               ` Andy Lutomirski
2023-06-20 18:48                 ` Dave Hansen
2023-06-20 20:18                   ` Kent Overstreet
2023-06-20 20:42                   ` Andy Lutomirski
2023-06-20 22:32                     ` Andy Lutomirski
2023-06-20 22:43                       ` Nadav Amit
2023-06-21  1:27                         ` Andy Lutomirski
2023-06-15 20:41 ` [PATCH 00/32] bcachefs - a new COW filesystem Pavel Machek
2023-06-15 21:26   ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZF6HHRDeUWLNtuL7@moria.home.lan \
    --to=kent.overstreet@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=ebiggers@kernel.org \
    --cc=hch@infradead.org \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-bcachefs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lstoakes@gmail.com \
    --cc=urezki@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox