From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 233FDC3DA64 for ; Thu, 18 Jul 2024 08:28:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 61D5A6B0088; Thu, 18 Jul 2024 04:28:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5CCCC6B0089; Thu, 18 Jul 2024 04:28:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 496BE6B008C; Thu, 18 Jul 2024 04:28:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2B0496B0088 for ; Thu, 18 Jul 2024 04:28:31 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A5951140DB1 for ; Thu, 18 Jul 2024 08:28:30 +0000 (UTC) X-FDA: 82352196780.23.399F84D Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf19.hostedemail.com (Postfix) with ESMTP id 4FD4A1A0009 for ; Thu, 18 Jul 2024 08:28:27 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="Gzt86/F8"; spf=pass (imf19.hostedemail.com: domain of vbabka@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721291268; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=uPTPHvedEgrMopF2QkWyX8kuEHET6kuPC2uJRjq7tmk=; b=eEio4bCH7p0iZnUhpd6FwcettC6aL4xZcEG6AVyXJi53dgXm6silFxEHce+tKq1DdW+KUx P0K74fzbXOcm+aiUH1fusB7bAA9tWSafh0TbiIWi09ylSxUoYgDXpFX9fKuGcMUiFVuVbb u9pdIfb4gNaKv7i4e9v6JeUUIi5H7X4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721291268; a=rsa-sha256; cv=none; b=PaUnLQ29u2l38Bg2sVdpYbS6vJw/ZacEarFExDad1QiOw0k2hnNsHQHPF5JDYEbH9K3VJQ rUj6PBgx/sSWmy7hrjCnYr5mFaW4JhwhuApJt+5XDVIquzBbJGSbyz+3AHcSDhVMlrgAao ceXnb8EFU/fz/p9EAGMYHn8QfhheYlQ= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="Gzt86/F8"; spf=pass (imf19.hostedemail.com: domain of vbabka@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=vbabka@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id B663ACE1862; Thu, 18 Jul 2024 08:28:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9A03FC116B1; Thu, 18 Jul 2024 08:28:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1721291303; bh=XCr6YKml31o80SiYtVd5tepSG/OazBj9Kki1ka47Nuw=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=Gzt86/F8fKULKJdmHFw2hK0g+kmipGTPLXpBJA43KiNbPufwXd0qao+ETGJQTNGJ0 W4TvA75UliI1IjRfMbPRL1nfzHO9fmubra4oFpo8PcYlUg+STcEFAfT7ntbE5QbKvo THNpYhwLRM3rTfsVjve4BleNDPnq5n2kvUiQ1L0jyCn184ofOy1UwIE1dtb746ekZm duEvJ5+kIm4pf8DZe/3APEdAvqd3GvFhYXnnP8dmyJkfI9UOq8XVDsHvHRXn6TMEQh UGu1A32FlmjSM0GqkkC928RP9lxCQEeePXvU9pnxDwUr0xEn1B3DmeWzyLxyCQP4TJ qFXegb3zRbzhA== Message-ID: Date: Thu, 18 Jul 2024 10:28:18 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 0/2] mm: skip memcg for certain address space Content-Language: en-US To: Qu Wenruo , Qu Wenruo , Michal Hocko Cc: linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Cgroups , Matthew Wilcox References: <8faa191c-a216-4da0-a92c-2456521dcf08@kernel.org> <9c0d7ce7-b17d-4d41-b98a-c50fd0c2c562@gmx.com> <9572fc2b-12b0-41a3-82dc-bb273bfdd51d@kernel.org> <3cc3e652-e058-4995-8347-337ae605ebab@suse.com> From: "Vlastimil Babka (SUSE)" In-Reply-To: <3cc3e652-e058-4995-8347-337ae605ebab@suse.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: eh81dcdhjgidw78usdo69e1hq357j7y3 X-Rspamd-Queue-Id: 4FD4A1A0009 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1721291307-718579 X-HE-Meta: U2FsdGVkX18jkD3O7GfJo++eUkmb+1snIUVLU+HY+9M5A/ms+OiSHOb/qZiNJN1uOu3hsKZcBlZm6YDTKdr98xzz0rOYPUGO4BSVTt+sCIf9p10AVKIq3IJmRlgLC8nP24VTms6Ecl5H/ZXuX7F6NRidZ1ABg1UcDf23jUqfU0VvFSdaUPEH3GhgJTbvZAUh8jnFspMbjfmciVKnL4TDMuaK4ipBBAaGeUQg+Qe6WrPc9zcMr+86YZ619xwXykjZbUdmGzTp+qgsPoKlN14Cpkt5w/Ol8dDd4ucDpJ6ZMtxTavMViNvm0jpJ4ACmN/XwI3r96vVYxnyDP/wsK3jHHiAGdPkbAel/7ejAL1/cEwDywmVLAUdzVLKWpU6yqka0iwD+zlgAQho5Wn+EX9FsXKAwf1M0N2Bxpq9xUaqHP3tLzSvwSw5VNLwJnS4KMmuRFfN7JEq0i6RqabeRWLZRUagCMzauGy1FO3UaMR4pWrc0Xl7BjS70swjt9ng8+HUF8+LiXgVcHUmi7fJmsC6EKZTv9gNMnxfkCw2yQKdOX3+rW4hmcHK1ZrsL2YwaSoLCpuds3yFcXWccdYdD8tcmU3EWrMp6tiPnxVQsN1kmT7mvZIzD5RVydnP6HK0IxeGKqMhdgjiIwffBsTCsqxRoGsOL2Tj1hz1MVz5JCIAdxfrWpjfYsx5BPP0Eo+wwLv8Yhst4mx5jRw8rphi+YJR9V9Ry0CbOCIBgZlWCXWgIUOCzqDqTkTJhyZVWff5DEKtvEdEv4LtrbD0R8DY5L/tYVvi9fVxLT/wcmBtY52moWu0L/aNTf1+oiPo4jEgxcUdtizNOoLf6VEhQJcABEeHCZKht+yOXb4gTN7B9rF9SWZaWClPMF9giqnejZ2ckKxK+oKFfrBpLvNSPUVak1oVPe9aBmwx1gcv97dSyonb1x+/woiPVbt1sTHcw5LJQmb0HwhepBHl4m8rmTUozRfa POIEOhM4 PB6h9QvBchICDLoxAiNgKIrSU5dJSrNhB5hm1z0pQkMSygZ9FvYwpQKT+/uWcWBOFoA1BnDu8KcSElvt1c5WflwgjQETf2EXc1H0zQQbapdnCLPNJ0jlPmwCpb49hLP9HEF9R9n8WFogrdYd2O000AznPFTk2lWVE4znIxb7KO/S3LiV6AM62fAXbBuxmDftX4oqfEw5VIzpLNqYdGf9s98vh3FTxiFtnOHjFQ/zz/NwUDsTjgQ7JLfBfNbzdwik2XCIC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 7/18/24 9:52 AM, Qu Wenruo wrote: > > > 在 2024/7/18 16:47, Vlastimil Babka (SUSE) 写道: >> On 7/18/24 12:38 AM, Qu Wenruo wrote: > [...] >>> Another question is, I only see this hang with larger folio (order 2 vs >>> the old order 0) when adding to the same address space. >>> >>> Does the folio order has anything related to the problem or just a >>> higher order makes it more possible? >> >> I didn't spot anything in the memcg charge path that would depend on the >> order directly, hm. Also what kernel version was showing these soft lockups? > > The previous rc kernel. IIRC it's v6.10-rc6. > > But that needs extra btrfs patches, or btrfs are still only doing the > order-0 allocation, then add the order-0 folio into the filemap. > > The extra patch just direct btrfs to allocate an order 2 folio (matching > the default 16K nodesize), then attach the folio to the metadata filemap. > > With extra coding handling corner cases like different folio sizes etc. Hm right, but the same code is triggered for high-order folios (at least for user mappable page cache) today by some filesystems AFAIK, so we should be seeing such lockups already? btrfs case might be special that it's for the internal node as you explain, but that makes no difference for filemap_add_folio(), right? Or is it the only user with GFP_NOFS? Also is that passed as gfp directly or are there some extra scoped gfp resctrictions involved? (memalloc_..._save()). >> >>> And finally, even without the hang problem, does it make any sense to >>> skip all the possible memcg charge completely, either to reduce latency >>> or just to reduce GFP_NOFAIL usage, for those user inaccessible inodes? >> >> Is it common to even use the filemap code for such metadata that can't be >> really mapped to userspace? > > At least XFS/EXT4 doesn't use filemap to handle their metadata. One of > the reason is, btrfs has pretty large metadata structure. > Not only for the regular filesystem things, but also data checksum. > > Even using the default CRC32C algo, it's 4 bytes per 4K data. > Thus things can go crazy pretty easily, and that's the reason why btrfs > is still sticking to the filemap solution. > >> How does it even interact with reclaim, do they >> become part of the page cache and are scanned by reclaim together with data >> that is mapped? > > Yes, it's handled just like all other filemaps, it's also using page > cache, and all the lru/scanning things. > > The major difference is, we only implement a small subset of the address > operations: > > - write > - release > - invalidate > - migrate > - dirty (debug only, otherwise falls back to filemap_dirty_folio()) > > Note there is no read operations, as it's btrfs itself triggering the > metadata read, thus there is no read/readahead. > Thus we're in the full control of the page cache, e.g. determine the > folio size to be added into the filemap. > > The filemap infrastructure provides 2 good functionalities: > > - (Page) Cache > So that we can easily determine if we really need to read from the > disk, and this can save us a lot of random IOs. > > - Reclaiming > > And of course the page cache of the metadata inode won't be > cloned/shared to any user accessible inode. > >> How are the lru decisions handled if there are no references >> for PTE access bits. Or can they be even reclaimed, or because there may >> e.g. other open inodes pinning this metadata, the reclaim is impossible? > > If I understand it correctly, we have implemented release_folio() > callback, which does the btrfs metadata checks to determine if we can > release the current folio, and avoid releasing folios that's still under > IO etc. I see, thanks. Sounds like there might be potentially some suboptimal handling in that the folio will appear inactive because there's no references that folio_check_references() can detect, unless there's some folio_mark_accessed() calls involved (I see some FGP_ACCESSED in btrfs so maybe that's fine enough) so reclaim could consider it often, only to be stopped by release_folio failing. >> >> (sorry if the questions seem noob, I'm not that much familiar with the page >> cache side of mm) > > No worry at all, I'm also a newbie on the whole mm part. > > Thanks, > Qu > >> >>> Thanks, >>> Qu >>