From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3946CEB2D9 for ; Tue, 1 Oct 2024 02:03:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 73EF628004D; Mon, 30 Sep 2024 22:03:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6EF04280036; Mon, 30 Sep 2024 22:03:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 568F328004D; Mon, 30 Sep 2024 22:03:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 32781280036 for ; Mon, 30 Sep 2024 22:03:56 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A5242140CAD for ; Tue, 1 Oct 2024 02:03:55 +0000 (UTC) X-FDA: 82623387630.02.E6CE8A0 Received: from mail-wm1-f47.google.com (mail-wm1-f47.google.com [209.85.128.47]) by imf29.hostedemail.com (Postfix) with ESMTP id 7A0C4120007 for ; Tue, 1 Oct 2024 02:03:53 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=afmEjMAA; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf29.hostedemail.com: domain of wqu@suse.com designates 209.85.128.47 as permitted sender) smtp.mailfrom=wqu@suse.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727748140; a=rsa-sha256; cv=none; b=8BPvRDaXv06G1ogvPQEEPmCJgmD3AhDuImH29XNLmhJiXZfpYF79Nit/AAOsW3NdktIRSf bnmTRzist6AMdq3XlCB0/97xvZaDh16XMPzwdiv59t5f6A3ykHQHJMioJSj8uQo3J8qE5d RQFETVgBEf/M8dUiFD6Mxyi7xfC4XeE= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=suse.com header.s=google header.b=afmEjMAA; dmarc=pass (policy=quarantine) header.from=suse.com; spf=pass (imf29.hostedemail.com: domain of wqu@suse.com designates 209.85.128.47 as permitted sender) smtp.mailfrom=wqu@suse.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727748140; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=y2XJ1eZ28xIo+GDgkaOimxZFzT7dJYCNXgTF8/d60Ik=; b=3gO8jLqkh35Y3LNfyhaEzC8u52LwuM/nGBa5O5Ht85Msj1aUF7zR+ndV2dJy4wgmXqpAI+ wqwoqWQwxzyKBrTUXgbLCFxDoGrBNGvw6YYzmFySzn2bum00Ord/rzo2GWYGsxqIpX+xXl V+4IKZgLjXzDfEItbI9k3OyQAKbS7Es= Received: by mail-wm1-f47.google.com with SMTP id 5b1f17b1804b1-42cba8340beso35476505e9.1 for ; Mon, 30 Sep 2024 19:03:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=google; t=1727748232; x=1728353032; darn=kvack.org; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:from:to:cc:subject:date:message-id:reply-to; bh=y2XJ1eZ28xIo+GDgkaOimxZFzT7dJYCNXgTF8/d60Ik=; b=afmEjMAAfiED6Miyj3DWewRGaN87CDrVBKDPb1VU5pUvaTFEKf1pUXK4PBU7UG21D8 uaULIuaUV4jhb+rhtphrM+Xv6Hev83sKnOsHxQzDpy95lxVbBbIAbZntt/tm1xtQ9tbX yvd1lIhjTYN5i067iQxfK4/7Ey03HueqSLi1BumNEnMDovDdwrVAdIVZBiXW1y357EHR kkuRiO4TGnv6KIMbhzcjeivlgQ5iAYRfX/FVExyHGUsEsjS0McWb1D+rmu4wtA0+hI9K ByidJcrp5ZANVJb9JiNxsGpth2z8OcF6RdhCeBI4BTY8LhczfvJFQO2cI7WjwmnnY/60 iIoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727748232; x=1728353032; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=y2XJ1eZ28xIo+GDgkaOimxZFzT7dJYCNXgTF8/d60Ik=; b=uYf1whGcwE6FYN/meoDskaZMhHBkH8VXNmrYLW9ke3Esp/xVlogs9W/Pfht7f5vjt7 9mE6668bYL4+5ARw4JB/U3hvrP+5pnTpLn+ZGOzaQTwwHiMN0/UCmH4zeRtTDzgjWBJG xhZ9dy7Osb+zJFk4a4KQx1VliyeU6G/B+XSrGt38Dv3fpeR79WZAQVVkRu8lYvKKknkr Dm16HCSq2ZlEX3VseUsbFok0jgqVn6LTlEEsnxstv+TjboU5fAIND3VKi6l13NRTxEfT MuzK2UyYur75bKRfp9CO0J/xebeFIQtJx7OkQ/EMzovO61GU+ha5oiQaT4ZHi9KbGZta YZ4Q== X-Forwarded-Encrypted: i=1; AJvYcCWh/F8OAp4HMKdpn9va1ebUwZ8NwTGN4nU9o04CWCj7qvlOgkUqXpk7ofvu/+kVd+n8XbdWOD5PPQ==@kvack.org X-Gm-Message-State: AOJu0YyoTWUqsxQ3TUIlFrgLwqBsTGFewaCHw+xPUbMP+6HAq2tre1f3 vtdHX9+nglGJff4PIF5lyrRZBMLU0KK4WMPnyR5UxbZtm4wWmeDNZsF0kHd+/1I= X-Google-Smtp-Source: AGHT+IGrbVaEv02BAaGNjuWgSQHZ5Y9jc5QV9rWk1a63acmiVYjx9DCoIKIhFPNAa6nFmMy8Y++cZg== X-Received: by 2002:adf:fe05:0:b0:374:c33d:377d with SMTP id ffacd0b85a97d-37cf28d67b7mr631698f8f.28.1727748231557; Mon, 30 Sep 2024 19:03:51 -0700 (PDT) Received: from ?IPV6:2403:580d:fda1::299? (2403-580d-fda1--299.ip6.aussiebb.net. [2403:580d:fda1::299]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20b37e38ccfsm60115475ad.200.2024.09.30.19.03.46 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 30 Sep 2024 19:03:50 -0700 (PDT) Message-ID: <54f0bbef-267b-48d9-ae09-0f3907d4fdc3@suse.com> Date: Tue, 1 Oct 2024 11:33:44 +0930 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] btrfs: root memcgroup for metadata filemap_add_folio() To: Shakeel Butt Cc: linux-btrfs@vger.kernel.org, hannes@cmpxchg.org, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, akpm@linux-foundation.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , "Vlastimil Babka (SUSE)" References: <08ccb40d-6261-4757-957d-537d295d2cf5@suse.com> <7jmtrebounxuu44qgmc2y52bqlqdyuko7zp53p6iz6rkzmzzqg@m2csfnfbmv6c> Content-Language: en-US From: Qu Wenruo Autocrypt: addr=wqu@suse.com; keydata= xsBNBFnVga8BCACyhFP3ExcTIuB73jDIBA/vSoYcTyysFQzPvez64TUSCv1SgXEByR7fju3o 8RfaWuHCnkkea5luuTZMqfgTXrun2dqNVYDNOV6RIVrc4YuG20yhC1epnV55fJCThqij0MRL 1NxPKXIlEdHvN0Kov3CtWA+R1iNN0RCeVun7rmOrrjBK573aWC5sgP7YsBOLK79H3tmUtz6b 9Imuj0ZyEsa76Xg9PX9Hn2myKj1hfWGS+5og9Va4hrwQC8ipjXik6NKR5GDV+hOZkktU81G5 gkQtGB9jOAYRs86QG/b7PtIlbd3+pppT0gaS+wvwMs8cuNG+Pu6KO1oC4jgdseFLu7NpABEB AAHNGFF1IFdlbnJ1byA8d3F1QHN1c2UuY29tPsLAlAQTAQgAPgIbAwULCQgHAgYVCAkKCwIE FgIDAQIeAQIXgBYhBC3fcuWlpVuonapC4cI9kfOhJf6oBQJjTSJVBQkNOgemAAoJEMI9kfOh Jf6oapEH/3r/xcalNXMvyRODoprkDraOPbCnULLPNwwp4wLP0/nKXvAlhvRbDpyx1+Ht/3gW p+Klw+S9zBQemxu+6v5nX8zny8l7Q6nAM5InkLaD7U5OLRgJ0O1MNr/UTODIEVx3uzD2X6MR ECMigQxu9c3XKSELXVjTJYgRrEo8o2qb7xoInk4mlleji2rRrqBh1rS0pEexImWphJi+Xgp3 dxRGHsNGEbJ5+9yK9Nc5r67EYG4bwm+06yVT8aQS58ZI22C/UeJpPwcsYrdABcisd7dddj4Q RhWiO4Iy5MTGUD7PdfIkQ40iRcQzVEL1BeidP8v8C4LVGmk4vD1wF6xTjQRKfXHOwE0EWdWB rwEIAKpT62HgSzL9zwGe+WIUCMB+nOEjXAfvoUPUwk+YCEDcOdfkkM5FyBoJs8TCEuPXGXBO Cl5P5B8OYYnkHkGWutAVlUTV8KESOIm/KJIA7jJA+Ss9VhMjtePfgWexw+P8itFRSRrrwyUf E+0WcAevblUi45LjWWZgpg3A80tHP0iToOZ5MbdYk7YFBE29cDSleskfV80ZKxFv6koQocq0 vXzTfHvXNDELAuH7Ms/WJcdUzmPyBf3Oq6mKBBH8J6XZc9LjjNZwNbyvsHSrV5bgmu/THX2n g/3be+iqf6OggCiy3I1NSMJ5KtR0q2H2Nx2Vqb1fYPOID8McMV9Ll6rh8S8AEQEAAcLAfAQY AQgAJgIbDBYhBC3fcuWlpVuonapC4cI9kfOhJf6oBQJjTSJuBQkNOge/AAoJEMI9kfOhJf6o rq8H/3LJmWxL6KO2y/BgOMYDZaFWE3TtdrlIEG8YIDJzIYbNIyQ4lw61RR+0P4APKstsu5VJ 9E3WR7vfxSiOmHCRIWPi32xwbkD5TwaA5m2uVg6xjb5wbdHm+OhdSBcw/fsg19aHQpsmh1/Q bjzGi56yfTxxt9R2WmFIxe6MIDzLlNw3JG42/ark2LOXywqFRnOHgFqxygoMKEG7OcGy5wJM AavA+Abj+6XoedYTwOKkwq+RX2hvXElLZbhYlE+npB1WsFYn1wJ22lHoZsuJCLba5lehI+// ShSsZT5Tlfgi92e9P7y+I/OzMvnBezAll+p/Ly2YczznKM5tV0gboCWeusM= In-Reply-To: <7jmtrebounxuu44qgmc2y52bqlqdyuko7zp53p6iz6rkzmzzqg@m2csfnfbmv6c> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 7A0C4120007 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: eacwxq8r8cdzjjo6rj55gn6yuy9azh9k X-HE-Tag: 1727748233-325243 X-HE-Meta: U2FsdGVkX190UuJC8pC7jpr1XcR5B5sksbqvK/v8lXXYZNIglNjTMGXk8H0h3y+sNubvmLMqyTZj91CZ8ujerhOphkqt27lK+oEvlevbBtbJO0n0YGae62fA5Acq9sPRT0CbYuNYlarvQnbfds915tFN5yXjkGIunFtEOJBxtn9rYS8Gd9OX0S3NGwfV6PbK+h8gw5BV89TdLij1Cp9RO2Lhgn8gSGazgCQXdkajgU5TniuMOv/WtDuoZVXsMcOwvVoWHHT69zBvTCRXoZXpQod2C/M/0zWAXdH969LlQWJNji724qcMZZ0nTIdRi+SJejMyZpAk5EGPtPHy1QvOcZTou40w7PjWxK/YQfeeHN6y8+w7EdRik3qYJjOCv6ZLeVbegzhDrSY54lsFnP5/4drYJdI2egslUYo/a0rQWlTpw+G3+mSkFoeXiee3E7XAjh/xnwirb57TMsYHANneYA1UbYkc41n/p99ylpeFTPT2cqh0Jxi0wNZTnYco+2TJmcBzeKmoQ5InNou8E4HSTo1jG8B0CO8OjLrThDoFU+5nFdEXk+04juQUE8DmcC1oecliwnBGfVCIz3VEl2PWWYVXUr0J8xE+QZhdPUwY+ckrxz+6LZjlOPFE1EGtUmmlrLctOw/aVEaEBkoFX/FmBcOtHw6A2ryZWlZUwg1ETxhLwozQB7nrSIu093a0PDnDl0+U9AZ1O1cJVhxqPZbGstG/BcPGd0UiYeUUtmRW4JFuCECbcL0t6vg+7R8CGNXV29O7ctiskT3gbU3dvNyJhcLg+xzMsFmFKgqhMR4MIZc5G/lum7oImDPRVJcMsgGWm7vjojvdF0Or9V8LdVUywPclGkn7X2t2jJ1GyHX4lu/LVKWDZ8Z27UVGuMWIjXLDi4jJt1Hz9Gier3TUx8TZ6vqqt5WmVUoU1ZPl4YtIzsCi6p7zIMvpXM2LpWEiCGUFWLzDvQdvJ5YaNgEQgQS 4blgl58I ohjfGufF4G2Z9y1z6k8NeVswDS2tAKF/M4HgHDAARmLFFTAGzBVo+G46bz4RJfGC82PTro8dvY0bmcYEAcX5DtLGL68BBcxkjAhbTxj1HQFLiTVP4EFGQs2YH8V9HOoO3EPo8F3yg2G89yUnnKUX/NaEidS5Xb0xZZV7Blhn+NNg5xYdW5vk2WpfYtc4PcA2LIMw4bkasyLfkzR9Eldc7adA4RjWGFSxfpVb2i9/6KPU4CoEXVN/oyMyDgu+X/3tC+gkDKTYTYx3ebyIKH7igXw+2aUqmZWSjZ+kMUQ4iZybYDsZ2X83rO+t0kkivDgCJetRY4ipJMuuD2ZRKGurvuE4JrPGjPUQUrakFideOP+gQcJAejX0l/LyhAg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/10/1 11:07, Shakeel Butt 写道: > On Tue, Oct 01, 2024 at 07:30:38AM GMT, Qu Wenruo wrote: [...] >> >> Although btrfs has error handling already for all the possible ENOMEMs, >> hitting ENOMEMs for metadata may still be a big problem, thus all my >> previous attempt to remove NOFAIL flag all got rejected. > > __GFP_NOFAIL for memcg charging is reasonable in many scenarios. Memcg > oom-killer is enabled for __GFP_NOFAIL and going over limit and getting > oom-killed is totally reasonable. Orthogonal to the discussion though. > >> >>> >>> 2. What the normal overhead of these metadata in real world production >>> environment? I see 4 to 32 bytes per 4k but what's the most used one and >>> does it depend on the data of 4k or something else? >> >> What did you mean by the "overhead" part? Did you mean the checksum? >> > > To me this metadata is overhead, so yes checksum is something not the > actual data stored on the storage. Oh, by "metadata" it means everything not data. It includes all the info like directory layout, file layout, data checksum and all the other needed info to represent a btrfs. > >> If so, there is none, because btrfs store metadata checksum inside the tree >> block (thus the page cache). >> The first 32 bytes of a tree block are always reserved for metadata >> checksum. >> >> The tree block size depends on the mkfs time option nodesize, is 16K by >> default, and that's the most common value. > > Sorry I am not very familiar with btrfs. What is tree block? A tree block of btrfs is a fixed block, containing metadata (aka, everything other than the data), organized in a B-tree structure. A tree block can be a node, containing the pointers to the next level nodes/leaves. Or a leave, contains the key and the extra info bound to that key. And btrfs uses the same tree block structure for all different kind of info. E.g. an inode is stored with ( INODE_ITEM 0) as the key, with a btrfs_inode_item structure as the extra data bound to that key. And a file extent is stored with ( EXTENT_DATA ) as the key, with a btrfs_file_extent_item structure bound to that key. > >> >>> >>> 3. Most probably multiple metadata values are colocated on a single 4k >>> page of the btrfs page cache even though the corresponding page cache >>> might be charged to different cgroups. Is that correct? >> >> Not always a single 4K page, it depends on the nodesize, which is 16K by >> default. >> >> Otherwise yes, the metadata page cache can be charged to different cgroup, >> depending on the caller's context. >> And we do not want to charge the metadata page cache to the caller's cgroup, >> since it's really a shared resource and the caller has no way to directly >> accessing the page cache. >> >> Not charging the metadata page cache will align btrfs more to the ext4/xfs, >> which all uses regular page allocation without attaching to a filemap. >> > > Can you point me to ext4/xfs code where they are allocating uncharged > memory for their metadata? For xfs, it's inside fs/xfs/xfs_buf.c. E.g. xfs_buf_alloc_pages(), which goes with kzalloc() to allocate needed pages. For ext4 it's using buffer header, which is I'm not familiar at all. But it looks like the bh folios are from the block device mapping, which may still be charged by cgroup. Thanks, Qu > >>> >>> 4. What is stopping us to use reclaimable slab cache for this metadata? >> >> Josef has tried this before, the attempt failed on the shrinker part, and >> partly due to the size. >> >> Btrfs has very large metadata compared to all other fses, not only due to >> the COW nature and a larger tree block size (16K by default), but also the >> extra data checksum (4 bytes per 4K by default, 32 bytes per 4K maximum). >> >> On a real world system, the metadata itself can easily go hundreds of GiBs, >> thus a shrinker is definitely needed. > > This amount of uncharged memory is concerning which becomes part of > system overhead and may impact the schedulable memory for the datacenter > environment. > > Overall the code seems fine and no pushback from me if btrfs maintainers > are ok with this. I think btrfs should move to slab+shrinker based > solution for this metadata unless there is deep technical reason not to. > > thanks, > Shakeel