From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37E59CEB2C9 for ; Mon, 30 Sep 2024 22:00:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A149428002F; Mon, 30 Sep 2024 18:00:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C46E28002D; Mon, 30 Sep 2024 18:00:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 864F428002F; Mon, 30 Sep 2024 18:00:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 6A60528002D for ; Mon, 30 Sep 2024 18:00:49 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id D162B160CC1 for ; Mon, 30 Sep 2024 22:00:48 +0000 (UTC) X-FDA: 82622774976.23.2A0A556 Received: from mail-wr1-f41.google.com (mail-wr1-f41.google.com [209.85.221.41]) by imf17.hostedemail.com (Postfix) with ESMTP id 9189D4001F for ; Mon, 30 Sep 2024 22:00:46 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=HbtJmaDb; spf=pass (imf17.hostedemail.com: domain of wqu@suse.com designates 209.85.221.41 as permitted sender) smtp.mailfrom=wqu@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727733520; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=L2mFpUOHLDzVxx/ZIGFl4CWIEl1Ta1q5zyxabezUC7o=; b=F0vS1enFCyVcnEdLQWFlbE5eo6ERVq3XnGO8XJ56kJ2ttEur5LTyEAhNtd55hwGLwYF5Az 5lIudrK3Sz//eAjmB/StwICC/RQzRzeP2edfoktalkI4NxkNF/QWfe8cbUcHOz9+vvdwUK mMK/E4lGrWhxG2Wu1EVOUIfTaUBQlzU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727733520; a=rsa-sha256; cv=none; b=g4d4FItqeY3cFGVwaVWn6fWIxpgEmiRg0+aLrp09JT3o/Icz6N71E79d28xI5FUR+Z48f9 iZCwmZdTOs3zDYcoySK6ZPfgMd5APDqjF3Z24QjCY/lTdpSQSTbQwb9KTwET3j+dXrg0f7 Wy5XEJVMvneBFFrYKyWC0g03WTqGdYU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=HbtJmaDb; spf=pass (imf17.hostedemail.com: domain of wqu@suse.com designates 209.85.221.41 as permitted sender) smtp.mailfrom=wqu@suse.com; dmarc=pass (policy=quarantine) header.from=suse.com Received: by mail-wr1-f41.google.com with SMTP id ffacd0b85a97d-37ce8458ae3so1443853f8f.1 for ; Mon, 30 Sep 2024 15:00:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1727733645; x=1728338445; darn=kvack.org; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:from:to:cc:subject:date:message-id:reply-to; bh=L2mFpUOHLDzVxx/ZIGFl4CWIEl1Ta1q5zyxabezUC7o=; b=HbtJmaDbk1x5lbsAjQA13wAeodwApx1qL4JUzmDTMIcuDdRatYSb974NxrABqOyYDM 6dEyhzbJvTOXTbGHyO++VKDP0IYQ/i7LiOfXX6M5rqjWafGnf758T1Eg3XFUTagt9Adk w6ebLYbX2b5Fi9qPiCQ2LEOlENwxjQwL4+jcqX/30DtREi2/XOypS4joZt2KeKm+zUqm ILZBn/DehjdgSz6MJqOy6NL022Pacae4b2AKyzByQ9+DN9lTGnzmydpK897Zthyw7EJm SKEez43h213rrV1K0bgWxGRAUKGkocIXTrk61HeexL/L2VSgKvj5qbpBN2/sGlPCkSWG Zd9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727733645; x=1728338445; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=L2mFpUOHLDzVxx/ZIGFl4CWIEl1Ta1q5zyxabezUC7o=; b=wpTq2e7o+4Gk8Z6aPbUZkccXiqp6LJiQlU96gKPoUNjPEdrOBVIIE7mATdfWfDteV3 kPdo+SIMkC6GDmi5LgymIDh7ibndk8aRGpGJuejdC7+kVIctO/2UeIqsNKX1+Vobih3R 0lTRFXcUXOgkh+oBbJjo+QnMXPWF+4oLvHslhuWx3KYDUM26nFc0IxF63VeptsBtlZvK 5geSGBi9q5i5906rkbOI7uVgodEqqeCOCKDKhPKSH93+SUdwfAsK3uwnBE5kkI9oyQpa a33PuBPRvQmcEZFQX6wwxbWKQTdO7Q9d7gt6mUJxkrcwcxqvM+t1ZcWYQ61Vy98Uejkh Bokw== X-Forwarded-Encrypted: i=1; AJvYcCWWS2z4Vd2Yu67qwRgsk2qd5dM94A+ro1/XkB8KwR8/58qypkovJP1flOUAUxrG73YCl0PHUov/Aw==@kvack.org X-Gm-Message-State: AOJu0YxfGgepmYF0Purgms2fD7BGbv7N1Q2FdFoV90tG5rLGWMAqc4IJ +Pc5DiKC7YrOGwGawdHqxLrt9oJvmnF9O7DR9q1YhIv9CniT0HWqkdkZCLxopoI= X-Google-Smtp-Source: AGHT+IHFYeBKfI1EAN3zAjxXMPApcDZWCKA60zfD/7mFgkcgN9iCUPnZcVes0tbuxzyWue7tmWzc9A== X-Received: by 2002:adf:b31a:0:b0:374:c3cd:73de with SMTP id ffacd0b85a97d-37cd5ab1164mr12881784f8f.35.1727733644688; Mon, 30 Sep 2024 15:00:44 -0700 (PDT) Received: from ?IPV6:2403:580d:fda1::299? (2403-580d-fda1--299.ip6.aussiebb.net. [2403:580d:fda1::299]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20b37d92386sm59082215ad.90.2024.09.30.15.00.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 30 Sep 2024 15:00:44 -0700 (PDT) Message-ID: <08ccb40d-6261-4757-957d-537d295d2cf5@suse.com> Date: Tue, 1 Oct 2024 07:30:38 +0930 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] btrfs: root memcgroup for metadata filemap_add_folio() To: Shakeel Butt Cc: linux-btrfs@vger.kernel.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , "Vlastimil Babka (SUSE)" References: Content-Language: en-US From: Qu Wenruo Autocrypt: addr=wqu@suse.com; keydata= xsBNBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAHNGFF1IFdlbnJ1byA8d3F1QHN1c2UuY29tPsLAlAQTAQgAPgIbAwULCQgHAgYVCAkKCwIE FgIDAQIeAQIXgBYhBC3fcuWlpVuonapC4cI9kfOhJf6oBQJjTSJVBQkNOgemAAoJEMI9kfOh Jf6oapEH/3r/xcalNXMvyRODoprkDraOPbCnULLPNwwp4wLP0/nKXvAlhvRbDpyx1+Ht/3gW p+Klw+S9zBQemxu+6v5nX8zny8l7Q6nAM5InkLaD7U5OLRgJ0O1MNr/UTODIEVx3uzD2X6MR ECMigQxu9c3XKSELXVjTJYgRrEo8o2qb7xoInk4mlleji2rRrqBh1rS0pEexImWphJi+Xgp3 dxRGHsNGEbJ5+9yK9Nc5r67EYG4bwm+06yVT8aQS58ZI22C/UeJpPwcsYrdABcisd7dddj4Q RhWiO4Iy5MTGUD7PdfIkQ40iRcQzVEL1BeidP8v8C4LVGmk4vD1wF6xTjQRKfXHOwE0EWdWB rwEIAKpT62HgSzL9zwGe+WIUCMB+nOEjXAfvoUPUwk+YCEDcOdfkkM5FyBoJs8TCEuPXGXBO Cl5P5B8OYYnkHkGWutAVlUTV8KESOIm/KJIA7jJA+Ss9VhMjtePfgWexw+P8itFRSRrrwyUf E+0WcAevblUi45LjWWZgpg3A80tHP0iToOZ5MbdYk7YFBE29cDSleskfV80ZKxFv6koQocq0 vXzTfHvXNDELAuH7Ms/WJcdUzmPyBf3Oq6mKBBH8J6XZc9LjjNZwNbyvsHSrV5bgmu/THX2n g/3be+iqf6OggCiy3I1NSMJ5KtR0q2H2Nx2Vqb1fYPOID8McMV9Ll6rh8S8AEQEAAcLAfAQY AQgAJgIbDBYhBC3fcuWlpVuonapC4cI9kfOhJf6oBQJjTSJuBQkNOge/AAoJEMI9kfOhJf6o rq8H/3LJmWxL6KO2y/BgOMYDZaFWE3TtdrlIEG8YIDJzIYbNIyQ4lw61RR+0P4APKstsu5VJ 9E3WR7vfxSiOmHCRIWPi32xwbkD5TwaA5m2uVg6xjb5wbdHm+OhdSBcw/fsg19aHQpsmh1/Q bjzGi56yfTxxt9R2WmFIxe6MIDzLlNw3JG42/ark2LOXywqFRnOHgFqxygoMKEG7OcGy5wJM AavA+Abj+6XoedYTwOKkwq+RX2hvXElLZbhYlE+npB1WsFYn1wJ22lHoZsuJCLba5lehI+// ShSsZT5Tlfgi92e9P7y+I/OzMvnBezAll+p/Ly2YczznKM5tV0gboCWeusM= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: 4qcyknddja7z9jh5kte759a3t79p8cxk X-Rspamd-Queue-Id: 9189D4001F X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1727733646-771760 X-HE-Meta: U2FsdGVkX18K525WcolTQACxfe9TF3Epz7Tqn25rzm9NZrvgryrRPSFNIobBRF9BlRis4t3G2Ftd6Ewy9yIEEiOimhx9cQIECbGAy9jqM8X7UezgaFg4YhAuWDTJsTlPZzaJMtMcJ4oJLYyGKhwW9hIEXXd2bXAWMEB2qSYRbi1P2be2L4c8J+rD0nwofTR2ElGehAfpetDkmcatNFUd9PJqaOGVCN974v+5Ok9pvU4UXx24dW94IHgm9Rk5cWzLq9emlaA4N6trSTao3xlnZiEw7RXswu6Kel4v2Ipfa6KGwRByAQGkOGog7edy4ZUMMDj1kh9Q0sR/hKNNUi8wDlAQz/EQucMO34Sy98+0wSTapODMP73vtvIq36XtZWIedAsqy9DkMqZAjjd7aFAuwqx2sUDYHJlj2EWrMzf6VamdCb00ypfE8WB65iXXQe/Nrjy45vHv4tAzMyaebp32OcAvCqrbfsBzCoV6dcjAwTwxR9N39/5HHyKDjWNEqi4Yj9g+3k/w7zRJCzYAslM2VHCTUWGUOKdYIe8+rwSe+m+VpBgV+YZ75vVipojyMEDoq8Q0GZc/iRJGphlKtLWlKgk2o1gSSBFz4pcObRIplLDg+vLxJc/O5j4t/xwT+se5ZAOhdpU6QqISyM2G/Dxf1cQDrrKdN9/txjMS9idbDfUXK9ZP8iNz9ORQm411dw8JqhXbRO0KUOrAVGwGXqf1evGs7smvuK5ABhcWdwcVRZSahTW3eS2ytqnWrkKFeO7myToKgKJcF5Trf8wZH7SW4oxq/OchGHrGrOTm1VoMP+XlISoOMbWK1lVj8hEYnZ4yTYIvbZhoe/GW+h7tJhFj8RbhWWOgFYg6pcZVqfKXfzHlaVPIgUNsxslMy9NBa5ao8m7AKm0g16/7n4s3sVsD/e7juLJgKcD/RqVPryjNoe3BFYuz/CZrIuWxfhIykkQHhy6uHFg9g14KlJRlvTD LcCOlkTP MceKPjYZx1xMu+Vg1Fo/5oVt9Q2Mo+dHTV9o2CMV/iP6irbLz3ALj73/gJwBD02Z7btSAfiETNeo9TfX7k6GseFJePu6ltiONJsUxJ4jZcZKZIFusjF9jN6U3OYPabXbmS8dY91xM++dFPIIcZ9rFQ1Kwdk7LugIBy1qE8BWhjhkbgOQhi3ZOpWj1xyXL6y8REVIyEL6PJ59naexOD3IKM4bAyRIf8D97ZEefQGnPR1Msj5iVd+i4YU9OhQNgvHbuk8RXxeoY+9GQtvW7Mt/TXG5hYOY78cm0aJ9H2acuSQpe87/0Ipye9ZnSkC26r1/5rU5R7CPv9x/ikGDE2qEOxiCEtBw6TK97vgB8mJG+gOGrkClAD6q6Gd359A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/10/1 02:53, Shakeel Butt 写道: > Hi Qu, > > On Sat, Sep 28, 2024 at 02:15:56PM GMT, Qu Wenruo wrote: >> [BACKGROUND] >> The function filemap_add_folio() charges the memory cgroup, >> as we assume all page caches are accessible by user space progresses >> thus needs the cgroup accounting. >> >> However btrfs is a special case, it has a very large metadata thanks to >> its support of data csum (by default it's 4 bytes per 4K data, and can >> be as large as 32 bytes per 4K data). >> This means btrfs has to go page cache for its metadata pages, to take >> advantage of both cache and reclaim ability of filemap. >> >> This has a tiny problem, that all btrfs metadata pages have to go through >> the memcgroup charge, even all those metadata pages are not >> accessible by the user space, and doing the charging can introduce some >> latency if there is a memory limits set. >> >> Btrfs currently uses __GFP_NOFAIL flag as a workaround for this cgroup >> charge situation so that metadata pages won't really be limited by >> memcgroup. >> >> [ENHANCEMENT] >> Instead of relying on __GFP_NOFAIL to avoid charge failure, use root >> memory cgroup to attach metadata pages. >> >> Although this needs to export the symbol mem_root_cgroup for >> CONFIG_MEMCG, or define mem_root_cgroup as NULL for !CONFIG_MEMCG. >> >> With root memory cgroup, we directly skip the charging part, and only >> rely on __GFP_NOFAIL for the real memory allocation part. >> > > I have a couple of questions: > > 1. Were you using __GFP_NOFAIL just to avoid ENOMEMs? Are you ok with > oom-kills? The NOFAIL flag is inherited from the memory allocation for metadata tree blocks. Although btrfs has error handling already for all the possible ENOMEMs, hitting ENOMEMs for metadata may still be a big problem, thus all my previous attempt to remove NOFAIL flag all got rejected. > > 2. What the normal overhead of these metadata in real world production > environment? I see 4 to 32 bytes per 4k but what's the most used one and > does it depend on the data of 4k or something else? What did you mean by the "overhead" part? Did you mean the checksum? If so, there is none, because btrfs store metadata checksum inside the tree block (thus the page cache). The first 32 bytes of a tree block are always reserved for metadata checksum. The tree block size depends on the mkfs time option nodesize, is 16K by default, and that's the most common value. > > 3. Most probably multiple metadata values are colocated on a single 4k > page of the btrfs page cache even though the corresponding page cache > might be charged to different cgroups. Is that correct? Not always a single 4K page, it depends on the nodesize, which is 16K by default. Otherwise yes, the metadata page cache can be charged to different cgroup, depending on the caller's context. And we do not want to charge the metadata page cache to the caller's cgroup, since it's really a shared resource and the caller has no way to directly accessing the page cache. Not charging the metadata page cache will align btrfs more to the ext4/xfs, which all uses regular page allocation without attaching to a filemap. > > 4. What is stopping us to use reclaimable slab cache for this metadata? Josef has tried this before, the attempt failed on the shrinker part, and partly due to the size. Btrfs has very large metadata compared to all other fses, not only due to the COW nature and a larger tree block size (16K by default), but also the extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum). On a real world system, the metadata itself can easily go hundreds of GiBs, thus a shrinker is definitely needed. Thus so far btrfs is using page cache for its metadata cache. Thanks, Qu > > thanks, > Shakeel