linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-btrfs@vger.kernel.org, hannes@cmpxchg.org,
	mhocko@kernel.org, roman.gushchin@linux.dev,
	muchun.song@linux.dev, akpm@linux-foundation.org,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	Michal Hocko <mhocko@suse.com>,
	"Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Subject: Re: [PATCH] btrfs: root memcgroup for metadata filemap_add_folio()
Date: Tue, 1 Oct 2024 11:33:44 +0930	[thread overview]
Message-ID: <54f0bbef-267b-48d9-ae09-0f3907d4fdc3@suse.com> (raw)
In-Reply-To: <7jmtrebounxuu44qgmc2y52bqlqdyuko7zp53p6iz6rkzmzzqg@m2csfnfbmv6c>



在 2024/10/1 11:07, Shakeel Butt 写道:
> On Tue, Oct 01, 2024 at 07:30:38AM GMT, Qu Wenruo wrote:
[...]
>>
>> Although btrfs has error handling already for all the possible ENOMEMs,
>> hitting ENOMEMs for metadata may still be a big problem, thus all my
>> previous attempt to remove NOFAIL flag all got rejected.
> 
> __GFP_NOFAIL for memcg charging is reasonable in many scenarios. Memcg
> oom-killer is enabled for __GFP_NOFAIL and going over limit and getting
> oom-killed is totally reasonable. Orthogonal to the discussion though.
> 
>>
>>>
>>> 2. What the normal overhead of these metadata in real world production
>>> environment? I see 4 to 32 bytes per 4k but what's the most used one and
>>> does it depend on the data of 4k or something else?
>>
>> What did you mean by the "overhead" part? Did you mean the checksum?
>>
> 
> To me this metadata is overhead, so yes checksum is something not the
> actual data stored on the storage.

Oh, by "metadata" it means everything not data.

It includes all the info like directory layout, file layout, data 
checksum and all the other needed info to represent a btrfs.

> 
>> If so, there is none, because btrfs store metadata checksum inside the tree
>> block (thus the page cache).
>> The first 32 bytes of a tree block are always reserved for metadata
>> checksum.
>>
>> The tree block size depends on the mkfs time option nodesize, is 16K by
>> default, and that's the most common value.
> 
> Sorry I am not very familiar with btrfs. What is tree block?

A tree block of btrfs is a fixed block, containing metadata (aka, 
everything other than the data), organized in a B-tree structure.

A tree block can be a node, containing the pointers to the next level 
nodes/leaves.
Or a leave, contains the key and the extra info bound to that key.

And btrfs uses the same tree block structure for all different kind of 
info.

E.g. an inode is stored with (<ino> INODE_ITEM 0) as the key, with a 
btrfs_inode_item structure as the extra data bound to that key.

And a file extent is stored with (<ino> EXTENT_DATA <file pos>) as the 
key, with a btrfs_file_extent_item structure bound to that key.

> 
>>
>>>
>>> 3. Most probably multiple metadata values are colocated on a single 4k
>>> page of the btrfs page cache even though the corresponding page cache
>>> might be charged to different cgroups. Is that correct?
>>
>> Not always a single 4K page, it depends on the nodesize, which is 16K by
>> default.
>>
>> Otherwise yes, the metadata page cache can be charged to different cgroup,
>> depending on the caller's context.
>> And we do not want to charge the metadata page cache to the caller's cgroup,
>> since it's really a shared resource and the caller has no way to directly
>> accessing the page cache.
>>
>> Not charging the metadata page cache will align btrfs more to the ext4/xfs,
>> which all uses regular page allocation without attaching to a filemap.
>>
> 
> Can you point me to ext4/xfs code where they are allocating uncharged
> memory for their metadata?

For xfs, it's inside fs/xfs/xfs_buf.c.
E.g. xfs_buf_alloc_pages(), which goes with kzalloc() to allocate needed 
pages.

For ext4 it's using buffer header, which is I'm not familiar at all.
But it looks like the bh folios are from the block device mapping, which 
may still be charged by cgroup.

Thanks,
Qu

> 
>>>
>>> 4. What is stopping us to use reclaimable slab cache for this metadata?
>>
>> Josef has tried this before, the attempt failed on the shrinker part, and
>> partly due to the size.
>>
>> Btrfs has very large metadata compared to all other fses, not only due to
>> the COW nature and a larger tree block size (16K by default), but also the
>> extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum).
>>
>> On a real world system, the metadata itself can easily go hundreds of GiBs,
>> thus a shrinker is definitely needed.
> 
> This amount of uncharged memory is concerning which becomes part of
> system overhead and may impact the schedulable memory for the datacenter
> environment.
> 
> Overall the code seems fine and no pushback from me if btrfs maintainers
> are ok with this. I think btrfs should move to slab+shrinker based
> solution for this metadata unless there is deep technical reason not to.
> 
> thanks,
> Shakeel



  reply	other threads:[~2024-10-01  2:03 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-28  4:45 Qu Wenruo
2024-09-30 17:23 ` Shakeel Butt
2024-09-30 22:00   ` Qu Wenruo
2024-10-01  1:37     ` Shakeel Butt
2024-10-01  2:03       ` Qu Wenruo [this message]
2024-10-01  9:19 ` Christoph Hellwig
2024-10-01  9:40   ` Qu Wenruo
2024-10-02  7:41     ` Christoph Hellwig
2024-10-03  8:07       ` Michal Hocko
2024-10-03 20:39         ` Shakeel Butt
2024-10-03  8:11       ` Qu Wenruo
2024-10-03  8:22         ` Michal Hocko
2024-10-03  8:23           ` Qu Wenruo
2024-10-03 20:58       ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54f0bbef-267b-48d9-ae09-0f3907d4fdc3@suse.com \
    --to=wqu@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox