linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-btrfs@vger.kernel.org, hannes@cmpxchg.org,
	mhocko@kernel.org, roman.gushchin@linux.dev,
	muchun.song@linux.dev, akpm@linux-foundation.org,
	cgroups@vger.kernel.org, linux-mm@kvack.org,
	Michal Hocko <mhocko@suse.com>,
	"Vlastimil Babka (SUSE)" <vbabka@kernel.org>
Subject: Re: [PATCH] btrfs: root memcgroup for metadata filemap_add_folio()
Date: Tue, 1 Oct 2024 07:30:38 +0930	[thread overview]
Message-ID: <08ccb40d-6261-4757-957d-537d295d2cf5@suse.com> (raw)
In-Reply-To: <iwjlzsphxhqdpml5gn3t3qt5zhizgcmizel5vug7g7bwlkzeob@g2jlar2nynqb>



在 2024/10/1 02:53, Shakeel Butt 写道:
> Hi Qu,
> 
> On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote:
>> [BACKGROUND]
>> The function filemap_add_folio() charges the memory cgroup,
>> as we assume all page caches are accessible by user space progresses
>> thus needs the cgroup accounting.
>>
>> However btrfs is a special case, it has a very large metadata thanks to
>> its support of data csum (by default it's 4 bytes per 4K data, and can
>> be as large as 32 bytes per 4K data).
>> This means btrfs has to go page cache for its metadata pages, to take
>> advantage of both cache and reclaim ability of filemap.
>>
>> This has a tiny problem, that all btrfs metadata pages have to go through
>> the memcgroup charge, even all those metadata pages are not
>> accessible by the user space, and doing the charging can introduce some
>> latency if there is a memory limits set.
>>
>> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup
>> charge situation so that metadata pages won't really be limited by
>> memcgroup.
>>
>> [ENHANCEMENT]
>> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root
>> memory cgroup to attach metadata pages.
>>
>> Although this needs to export the symbol mem_root_cgroup for
>> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG.
>>
>> With root memory cgroup, we directly skip the charging part, and only
>> rely on __GFP_NOFAIL for the real memory allocation part.
>>
> 
> I have a couple of questions:
> 
> 1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with
> oom-kills?

The NOFAIL flag is inherited from the memory allocation for metadata 
tree blocks.

Although btrfs has error handling already for all the possible ENOMEMs, 
hitting ENOMEMs for metadata may still be a big problem, thus all my 
previous attempt to remove NOFAIL flag all got rejected.

> 
> 2. What the normal overhead of these metadata in real world production
> environment? I see 4 to 32 bytes per 4k but what's the most used one and
> does it depend on the data of 4k or something else?

What did you mean by the "overhead" part? Did you mean the checksum?

If so, there is none, because btrfs store metadata checksum inside the 
tree block (thus the page cache).
The first 32 bytes of a tree block are always reserved for metadata 
checksum.

The tree block size depends on the mkfs time option nodesize, is 16K by 
default, and that's the most common value.

> 
> 3. Most probably multiple metadata values are colocated on a single 4k
> page of the btrfs page cache even though the corresponding page cache
> might be charged to different cgroups. Is that correct?

Not always a single 4K page, it depends on the nodesize, which is 16K by 
default.

Otherwise yes, the metadata page cache can be charged to different 
cgroup, depending on the caller's context.
And we do not want to charge the metadata page cache to the caller's 
cgroup, since it's really a shared resource and the caller has no way to 
directly accessing the page cache.

Not charging the metadata page cache will align btrfs more to the 
ext4/xfs, which all uses regular page allocation without attaching to a 
filemap.

> 
> 4. What is stopping us to use reclaimable slab cache for this metadata?

Josef has tried this before, the attempt failed on the shrinker part, 
and partly due to the size.

Btrfs has very large metadata compared to all other fses, not only due 
to the COW nature and a larger tree block size (16K by default), but 
also the extra data checksum (4 bytes per 4K by default, 32 bytes per 4K 
maximum).

On a real world system, the metadata itself can easily go hundreds of 
GiBs, thus a shrinker is definitely needed.

Thus so far btrfs is using page cache for its metadata cache.

Thanks,
Qu

> 
> thanks,
> Shakeel



  reply	other threads:[~2024-09-30 22:00 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-28  4:45 Qu Wenruo
2024-09-30 17:23 ` Shakeel Butt
2024-09-30 22:00   ` Qu Wenruo [this message]
2024-10-01  1:37     ` Shakeel Butt
2024-10-01  2:03       ` Qu Wenruo
2024-10-01  9:19 ` Christoph Hellwig
2024-10-01  9:40   ` Qu Wenruo
2024-10-02  7:41     ` Christoph Hellwig
2024-10-03  8:07       ` Michal Hocko
2024-10-03 20:39         ` Shakeel Butt
2024-10-03  8:11       ` Qu Wenruo
2024-10-03  8:22         ` Michal Hocko
2024-10-03  8:23           ` Qu Wenruo
2024-10-03 20:58       ` Johannes Weiner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08ccb40d-6261-4757-957d-537d295d2cf5@suse.com \
    --to=wqu@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox