From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A954CD4F3C for ; Wed, 4 Sep 2024 22:34:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 970476B0276; Wed, 4 Sep 2024 18:34:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 91F696B0277; Wed, 4 Sep 2024 18:34:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7724F6B0278; Wed, 4 Sep 2024 18:34:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 56C926B0276 for ; Wed, 4 Sep 2024 18:34:46 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id F2D211C1687 for ; Wed, 4 Sep 2024 22:34:45 +0000 (UTC) X-FDA: 82528511730.24.C4D8514 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf15.hostedemail.com (Postfix) with ESMTP id E261FA0012 for ; Wed, 4 Sep 2024 22:34:43 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Te1T6N2y; spf=pass (imf15.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725489236; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WyO77SPTX9aXQDMtbIN6DxC5ZyN7lMLaLLUCQ5eXYrk=; b=FB7p8g1IsRDmJN4RAcGkQmt+kp9TywVi917SDW7p1BW/RDCT5jHJG3zOVWrCZXNS2HBZ+m 7l29LhkL0c4KbAspPsrntZI2buGvp2YtmAbrwG8bThimj6sqiF+Xcffl5uYkr3QRDHXkxy qtW7CQ5J7yWwhC2L1M/6MhYxUlJ7QGA= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=fromorbit-com.20230601.gappssmtp.com header.s=20230601 header.b=Te1T6N2y; spf=pass (imf15.hostedemail.com: domain of david@fromorbit.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=david@fromorbit.com; dmarc=pass (policy=quarantine) header.from=fromorbit.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725489236; a=rsa-sha256; cv=none; b=oPt3JxbVbEQrKJSSy1goQ56F/cY2y5xE/iabBCzoMloMK90WwERiSOYlf8DOld2ilB+454 vT8Ygk+2/Ey9wJHTsboBiWT7/xRW2QnTVuY+98ZHeW3ouQKImsGQrHxmBe4QBgasM+PQRe Z7Lp4yzmWTLJ9wgjR7X53bEIzcZdinE= Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-20570b42f24so2080755ad.1 for ; Wed, 04 Sep 2024 15:34:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20230601.gappssmtp.com; s=20230601; t=1725489282; x=1726094082; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=WyO77SPTX9aXQDMtbIN6DxC5ZyN7lMLaLLUCQ5eXYrk=; b=Te1T6N2y13kgSeOnWbzr92nMU5++xuV+TetiMl9xvQvVLxPuI0S+lniPKfN7ybOMXK 4ABffgN1LLeI/fN0dTsNUYU/g2AAlrjEq96KPXXWHpiCay84Skl1mTZqsCbp9ZKq+u/v avnLGckSU4qqUAZRvuqNqCUaSsK2qLacGlQ/H2Exhts2CN0EYj/WvzMG0+v5j8v/YfcB n+08rK56VnmwM1MZSPeOhaRVdueGoKvmTehFlybUaZU/gGbYopW9QV2Yrdw3fUGYAZhE nk0NQomu+kRSav0Noxhm8KO8Jt0yl3ECYqldmIbdfmXr7JgfyYt82jAo6Ldh1XeXwXNm xdYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725489282; x=1726094082; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=WyO77SPTX9aXQDMtbIN6DxC5ZyN7lMLaLLUCQ5eXYrk=; b=vA8AePew8Azy757ccILniEA/NLzxJgicOCIsASk0XNhQmKdS5knKSy6b86ZbxdcYZv 98sN44kLv6PK1l99AvkDiSFs8ZgNn7eCKfLFrktqha4QnW5HxBwEWSohdfLUmW6Wq7xl iHKLoQXpk3Gxzqh+FU6/jP/V55z9cmAwJG/uAiWKkSnIOLUwAXul/tuxjvyWdb4vztQV rjGyLn+zPNolAOon983pcRyDBr0gD91djYYjvmhKXGM4O9ziG4MeYgmCP/ddXi+h4wwH t08o6g250gY3Mk/OA4evAZLz+p2utVXW6vQ5Ran2qak/1ni/8exS7tqUESTQgKHfeyVo WBMg== X-Forwarded-Encrypted: i=1; AJvYcCX7SWkn22pmg4piK25gW2PlprEMV/LsJRBEwSLNHj+ex7vvXX3XrlAEkU8yaESg6OSMRleZxgl7Wg==@kvack.org X-Gm-Message-State: AOJu0Yz4HI506eUY6mwKFYWghxvQi0beTPNkBF+xOHFysfKNsAfD2ZMi 305O8ao8DienEamJ16aSHAZ1fSbeQyoOLMIkeE9V2T2A5gbpS7pmKmjbYo2LF4M= X-Google-Smtp-Source: AGHT+IEa7j7zwgb9tFFz5fxY6zFlznbHmhxEaWFrBcv9Z9lGCysxf/bgq/9UC3jOBUkEX0m0MasoXw== X-Received: by 2002:a17:902:f682:b0:206:9640:e747 with SMTP id d9443c01a7336-20699b21af3mr79663415ad.43.1725489282411; Wed, 04 Sep 2024 15:34:42 -0700 (PDT) Received: from dread.disaster.area (pa49-179-78-197.pa.nsw.optusnet.com.au. [49.179.78.197]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-206aea582f4sm18038375ad.233.2024.09.04.15.34.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 04 Sep 2024 15:34:41 -0700 (PDT) Received: from dave by dread.disaster.area with local (Exim 4.96) (envelope-from ) id 1slyaF-000tWk-0T; Thu, 05 Sep 2024 08:34:39 +1000 Date: Thu, 5 Sep 2024 08:34:39 +1000 From: Dave Chinner To: Kent Overstreet Cc: Michal Hocko , Andrew Morton , Christoph Hellwig , Yafang Shao , jack@suse.cz, Vlastimil Babka , Dave Chinner , Christian Brauner , Alexander Viro , Paul Moore , James Morris , "Serge E. Hallyn" , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-bcachefs@vger.kernel.org, linux-security-module@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 0/2 v2] remove PF_MEMALLOC_NORECLAIM Message-ID: References: <20240902095203.1559361-1-mhocko@kernel.org> <20240902145252.1d2590dbed417d223b896a00@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: E261FA0012 X-Stat-Signature: 9kpnom6a4yc3pxyrcgmk9sydsbf8ftkt X-HE-Tag: 1725489283-582576 X-HE-Meta: U2FsdGVkX1/t0vaW+Occ7ujhCDPVteI7If45wwABku6iuhJ3Mbe+O0e9qWdMEdmNK2tChVO+wCGPyGS8zhCFJfdUtnRZdzO9ZtHVT9ERo4OL7eBN+6cc/ut0QkXQF8qdvhXqQGb2BCdGStVRyT584o+meGq8UgQrEQJu8TJV/79uw6AkIx9OCba9rBUTnL5o5150cs7u0/ZJT/mmixpu5Yr9XohO7uftYPKVIAy8OZ5Uunf80k9TsZpBxLWe9yqFFHnXvDGpCTN7cVFM/gjj/mKbZZH+Ua0FBJPlYiMIVDA2I2q1w6eEHvHZTOIu7QqPRujxXenUkCF4mEaRpY20J8j3JOAlwjawaIDT6NjSC1WvtvTjknP++/WbJnFGM9L4poA041rofXYH7QN5dloVo1M5GMqPt3zwGHjv1+AC0llrR7ru22+0uDVGmcy20ZOmj08AsC8tsav9ibKpF51a0J1bC8Gk22/PbKbfOzG34AVjWafHcI7Bo9ADuw/pBTLRft5A93PEboyiJ4gRIlc1MOVol8vsFr7ZC42orydpK/34IdiSJylYrbLtl4z56G8IdSpq+i4+fV6frSbEu7U8pwTkTB64yal+FbmX7Bn91xSSPxILApUG9BUIuLy2Dp22rpO4X4vn0ak7h8YnULAeipRw6kvi5MMsJ7dRw5Ec90i2DCEka+fFyyf9iFAibymvPBBzpW9NFB4yvcDG0n/I/NhtwuZAAjrUmrLD+C110UFEKcy0PAydJTTdJGazDyuli9ZdMe8uppbvwJe6nhIwoU72P78aR+c5mwepJcui5vAQ/2DoLe+F5yslUr7FQPg0OGcwuOfzI0lOVcU22aue8QV251LxYpeMK5BctanVhVP6Icl55LWmmMkigE+EwEP9jlNELGfW9djCRBLGHC8QyJAYXTvL8wfYbUqhaxbHNGvMT3SegOmsamtoze+0Yt7axmMHk5dTaOe/dWTVilS 5OAikiu2 Q1g0PWRQKHEQGTPFJ7kUOpxknmAUXRxTjv77EizXuzjvLpPd6Qqt1/Ryt09oj1wsBmBuEOrq73rj0+glV28eSP216TQ/w8WeNDJARo+ZAt+nsGLrkzoly31zQaZTQNEWE0QcozFlQFHeMVvqF7hwmd2ODlnCzJSkdSDWzY34qGO5DtSThDfAK2qACZB6JMc3kO05ElKnFou9UakDWqXDRarvwAXyPBpG3pJgOn5Q1Ab28XodXDvVeR7yvdgvkUKopEmouBkraVfrdyFZgMVTVzFD4ct0cxYFfJjnjKVrqpGkD0whjz+HVB+mpq1afOzA+fiPAZqu76YbzbDjYrrVnid+Q4f1cz+S5lE2PuwBhhSXxUpzMPCF8CGYzeqTzH5QHVTx2hNRSoPBEOA+U8mlOvBoyednRv4gnGmPivtxc5DuaJ1fsV14SCuUsPKRES4Y8ZNMqm2AEGR81T+I8kAmu/IhVfuvCiX+mgZLDUKgeRmuDl8/zf7aAgphUpGFLh6yywYPIoxyKbf9EhNYEA5kO5a4T8A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Sep 04, 2024 at 02:03:13PM -0400, Kent Overstreet wrote: > On Wed, Sep 04, 2024 at 06:46:00PM GMT, Michal Hocko wrote: > > On Wed 04-09-24 12:05:56, Kent Overstreet wrote: > > > But it seems to me that the limit should be lower if you're on e.g. a 2 > > > GB machine (not failing with a warning, just failing immediately rather > > > than oom killing a bunch of stuff first) - and it's going to need to be > > > raised above INT_MAX as large memory machines keep growing, I keep > > > hitting it in bcachefs fsck code. > > > > Do we actual usecase that would require more than couple of MB? The > > amount of memory wouldn't play any actual role then. > > Which "amount of memory?" - not parsing that. > > For large allocations in bcachefs: in journal replay we read all the > keys in the journal, and then we create a big flat array with references > to all of those keys to sort and dedup them. > > We haven't hit the INT_MAX size limit there yet, but filesystem sizes > being what they are, we will soon. I've heard of users with 150 TB > filesystems, and once the fsck scalability issues are sorted we'll be > aiming for petabytes. Dirty keys in the journal scales more with system > memory, but I'm leasing machines right now with a quarter terabyte of > ram. I've seen xfs_repair require a couple of TB of RAM to repair metadata heavy filesystems of relatively small size (sub-20TB). Once you get about a few hundred GB of metadata in the filesystem, the fsck cross-reference data set size can easily run into the TBs. So 256GB might *seem* like a lot of memory, but we were seeing xfs_repair exceed that amount of RAM for metadata heavy filesystems at least a decade ago... Indeed, we recently heard about a 6TB filesystem with 15 *billion* hardlinks in it. The cross reference for resolving all those hardlinks would require somewhere in the order of 1.5TB of RAM to hold. The only way to reliably handle random access data sets this large is with pageable memory.... > Another more pressing one is the extents -> backpointers and > backpointers -> extents passes of fsck; we do a linear scan through one > btree checking references to another btree. For the btree we're checking > references to the lookups are random, so we need to cache and pin the > entire btree in ram if possible, or if not whatever will fit and we run > in multiple passes. > > This is the #1 scalability issue hitting a number of users right now, so > I may need to rewrite it to pull backpointers into an eytzinger array > and do our random lookups for backpointers on that - but that will be > "the biggest vmalloc array we can possible allocate", so the INT_MAX > size limit is clearly an issue there... Given my above comments, I think you are approaching this problem the wrong way. It is known that the data set that can exceed physical kernel memory size, hence it needs to be swappable. That way users can extend the kernel memory capacity via swapfiles when bcachefs.fsck needs more memory than the system has physical RAM. This is a problem Darrick had to address for the XFS online repair code - we've known for a long time that repair needs to hold a data set larger than physical memory to complete successfully. Hence for online repair we needed a mechanism that provided us with pagable kernel memory. vmalloc() is not an option - it has hard size limits (both API based and physical capacity based). Hence Darrick designed and implemented pageable shmem backed memory files (xfiles) to hold these data sets. Hence the size limit of the online repair data set is physical RAM + swap space, same as it is for offline repair. You can find the xfile code in fs/xfs/scrub/xfile.[ch]. Support for large, sortable arrays of fixed size records built on xfiles can be found in xfarray.[ch], and blob storage in xfblob.[ch]. vmalloc() is really not a good solution for holding arbitrary sized data sets in kernel memory.... -Dave. -- Dave Chinner david@fromorbit.com