From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7F4D7C25B78 for ; Tue, 4 Jun 2024 16:53:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 14A486B0085; Tue, 4 Jun 2024 12:53:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0FB436B008A; Tue, 4 Jun 2024 12:53:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F03B46B008C; Tue, 4 Jun 2024 12:53:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D41616B0085 for ; Tue, 4 Jun 2024 12:53:24 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6EDC71A05A4 for ; Tue, 4 Jun 2024 16:53:24 +0000 (UTC) X-FDA: 82193801928.22.B5A936A Received: from mail-pj1-f47.google.com (mail-pj1-f47.google.com [209.85.216.47]) by imf22.hostedemail.com (Postfix) with ESMTP id 83AA2C001E for ; Tue, 4 Jun 2024 16:53:22 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=toxicpanda-com.20230601.gappssmtp.com header.s=20230601 header.b=RF5zt5d4; spf=none (imf22.hostedemail.com: domain of josef@toxicpanda.com has no SPF policy when checking 209.85.216.47) smtp.mailfrom=josef@toxicpanda.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717520002; a=rsa-sha256; cv=none; b=RBbUu25ow1t2Jacs0XH2DkC8v2dGdiFKE028OyMt87oIoMdSU4wpaPs5ONYmyvJCNAVGHo or1vqKdtSoLvBFIMPA9xUcr4+QHWf9d8lZOvi8ln5v6sBYS+YKg602yaRJLGufUJJONNfs Drzzt6G66f/Uw7s3wH7wFN8pC1AMh4k= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=toxicpanda-com.20230601.gappssmtp.com header.s=20230601 header.b=RF5zt5d4; spf=none (imf22.hostedemail.com: domain of josef@toxicpanda.com has no SPF policy when checking 209.85.216.47) smtp.mailfrom=josef@toxicpanda.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717520002; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7vQ85ddbl1Y3ttl8lJSOOTPnAC8qpNtaLzkY277e3vA=; b=65FpZAASJ8cdYxToYKEYlmmAUQgwCkJE59SPk+g3mQ1aR58fb+QZaIDGHQJ9PqSES9LQlv X4T6gLQSG9qLrSKxwXIvPyY3YwqT3w/47C7uy/mzCfykF4CH0HOXbgvUBgvNGsvNgVJeHx FEv7JsEnDjYNNNKJT9qK+eZ8PQR38fo= Received: by mail-pj1-f47.google.com with SMTP id 98e67ed59e1d1-2c1a9b151bcso4419180a91.3 for ; Tue, 04 Jun 2024 09:53:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20230601.gappssmtp.com; s=20230601; t=1717520001; x=1718124801; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=7vQ85ddbl1Y3ttl8lJSOOTPnAC8qpNtaLzkY277e3vA=; b=RF5zt5d45hurmdrjj4OOb6prh+VX9kBSm8GEDGnoIb0BW8oOY6obA6kujOhB69RaE9 QLAoeLXwv915S37mdxJG9ljKX2I9WYzY03Ngaeo0jvfsWeyhpaaHL5MvXiLfAbEVtjb8 bKoNnRYiuNZpIg63WYCJ4AmSx3kZX9PyXmtIEqVrmhDlacg2ItVt+rgXsK0t8h9agedB mdtHVh0q3vyg6/dN4yRuybLmK+IqBG2TV6WvUiWtSV5s9lrdJ5ISORNuwGvhdbkQc2TC Fy+2acQvhJINdsJde7iTMrN2w9htTRN8w5nCdVBGmu/9pSImMtIGOjFGctwZfHO5mhZ5 pFKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717520001; x=1718124801; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=7vQ85ddbl1Y3ttl8lJSOOTPnAC8qpNtaLzkY277e3vA=; b=ePD09jJbYYN17EiYRMUbHpLD1XJubezpT49CTuTnalRU4CdUTIW1rB3hsL0C80XyyM 29PwiCFTfIQD/ySWGgmrmdoIEvtpfQpKCKNpOBWbtHBC6r5v+LoCo8nuooq3RDZ4DFTD gED7xpTzS3ji/9aSmfmZJe2D5FvxXVMdp23YFTbKSOL87Iu+6pTyD1NwMHLhqyh1Nq/w PdPRcQVEgL4er77y/TZXCCR2ztteo2AnazVTffc9GdObFnVaSAscbPSt0FvuEPIArWGh flV0CkmFr+SJUhGfT7Xv+twWNfM6Nms9PT7knycNlmeXgTe3hrY1IzMhoqYlLiAIbNBM pe4Q== X-Forwarded-Encrypted: i=1; AJvYcCWyaZzhcNxKBc8kQC7xZkr4N2USivMBDg+rl2bdDFnYP01yXJBEAPLb4KnvLapkLsLXynGIFg4LhvhE4b+pykxl6Lg= X-Gm-Message-State: AOJu0Yy6Go2CEDXuSvKA1EWwF3d7TKPZJ232O730kwF025CzKeMEGAfT 7IYU/WSh1/2Lm/wHTrO6Z96B3jyyiStVAj2nx2EAwY8J85JolhkG1NRmB9DrqgM= X-Google-Smtp-Source: AGHT+IHvwAhnFmofRW/FUlHCpb2IDlz0WHB2sx85SoTSq8DDRF9UfvrId/NyFG0DgwCG9iB1Id54yw== X-Received: by 2002:a17:90b:1215:b0:2bf:bb85:edc1 with SMTP id 98e67ed59e1d1-2c1dc5ccdf3mr10534146a91.40.1717520001152; Tue, 04 Jun 2024 09:53:21 -0700 (PDT) Received: from localhost ([2620:10d:c090:600::1:de74]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2c1a77afb24sm10700646a91.46.2024.06.04.09.53.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Jun 2024 09:53:20 -0700 (PDT) Date: Tue, 4 Jun 2024 12:53:19 -0400 From: Josef Bacik To: Bernd Schubert Cc: Miklos Szeredi , Jingbo Xu , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , lege.wang@jaguarmicro.com, "Matthew Wilcox (Oracle)" , "linux-mm@kvack.org" Subject: Re: [HELP] FUSE writeback performance bottleneck Message-ID: <20240604165319.GG3413@localhost.localdomain> References: <495d2400-1d96-4924-99d3-8b2952e05fc3@linux.alibaba.com> <67771830-977f-4fca-9d0b-0126abf120a5@fastmail.fm> <2f834b5c-d591-43c5-86ba-18509d77a865@fastmail.fm> <21741978-a604-4054-8af9-793085925c82@fastmail.fm> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <21741978-a604-4054-8af9-793085925c82@fastmail.fm> X-Rspamd-Queue-Id: 83AA2C001E X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: 4yrkrr39ssyz37or1w3bnqnfcc75hpm9 X-HE-Tag: 1717520002-66606 X-HE-Meta: U2FsdGVkX1/P8vFfvztyiotJ5jqjbYwLOhhzXcmTGZTspJq1B5wCD97oqip24yblYE1V7Rpga3TTRywoh0hLk1+dEyIJ5IBlQazOIfuv5x15AZLyIF+EaxMJKVXFzOzyEb84Vqd5W669In1tnq/WzSyGddDrGRQa+KvO4WU4Hb4T6jCmWR+uhBVVtw1YxgJA/tuqEodMl24T3FvByCToDz2VKj2CLmctztnokMCnUFY39jbhJhwURqfe/4TSEOGWSTCBgx/o0f+qT01hinF8QtIgcbDjba/BvNZGF3Wzj1B5nbfcqHsdZrqennfDLNR/I56GY9gwfCaDnmko2dVSEfl9jbFK3CgFXzt+pMQernveDbXyg/9eIl1HkVMO0+PdkEQ3/DpB6gPopOaK8aaxqD5EEpMm9H16t4PPdnLI8RcWbeX2iXU/AwqQtVIBUMIaqVXDTPNtSqu9Y3RLz0+v1X1qGU2f/fnjMnjFIIo9PDryzKgsmGYwGcPA9OyQXYaiOWjg8XSCq8mIpFIgbaatZ38Tz4q8OYRHBYxkXM7jvY890DbITrLn8w7+wfDAL6O1E3iCK7SCLjNsRiwjvwLipqtLUd4Z2J2BqIi4ZlOD2K8i8ziq7iGIAxIYSoe9kCGxsZvst4vu9DyaItIJXOrU+aMWZBTM0dpKUHGAi5GEr417fKeY1iZgfa48+OKEa9tlEFT4srJ4sv3lNHkSSKj8xn9me2myL4qsIaBbAK4zUMJ6S25YaaE/cAIrjN6XdJ+CFYxq4y9ZV8fDcrlmvX1DWyP40A1TZjjg3GIsSLgmYexO5QdoaZ2K+l1nxq1AwGeFCv8qaoUWbwN4D4dmqK9xfvsPUqHmysKwFMfY4gkSemCKdmcOY2Vyv6zJJRnCiyUjNw2xxo7/qcoZFHqJ4hSYBr0WaBizo9PbMDi87oio/MSTS2r2VflgKMefRlg9+CLa9GQvlAMqMWqXEl4OVq5 UagXspzj CF68mOAY38jCsv5ujVazv4rj0APIt6oMj/7BgKZ91mxeRWt3CWizaLhE87FtedkpkQFRVhnnFLmZdia/JK8nYGnU12jn6HlX+CD4acbZINLAM1ktTzCSoTbWbmu+P8VouCJdRJpCXBHyzTaGfoxNe/xtG6Iu/GNMHj7J91wWYzkXON0cyIspF+GrlJr/enty20pYDHDbUVRLSGJGZDfN/yNK4N+HPdux80w6dz+XbbH0ghX5GIk/eCGNZMmIeP79wLCes+EBsYyGtnVO6dtzW8P5nI9pxjB6CUHU/Auyn4n8FZhHFgmzhfv0Rfsj5mnD0NFt+2MIV+YO6GUKJk+6GF7ezjsQUPvSGUr/wxw0U6joAEjC3sPxgsqlVwNqcjj8z+tFJtR+gqWs+r6XwhmDbpO9ao4diIi3Un4pdpwN30EjW4k4s7MuvWaxW4s8emFeFXVFuadTDH6a0UsHyn3qiDElajA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote: > > > On 6/4/24 12:02, Miklos Szeredi wrote: > > On Tue, 4 Jun 2024 at 11:32, Bernd Schubert wrote: > > > >> Back to the background for the copy, so it copies pages to avoid > >> blocking on memory reclaim. With that allocation it in fact increases > >> memory pressure even more. Isn't the right solution to mark those pages > >> as not reclaimable and to avoid blocking on it? Which is what the tmp > >> pages do, just not in beautiful way. > > > > Copying to the tmp page is the same as marking the pages as > > non-reclaimable and non-syncable. > > > > Conceptually it would be nice to only copy when there's something > > actually waiting for writeback on the page. > > > > Note: normally the WRITE request would be copied to userspace along > > with the contents of the pages very soon after starting writeback. > > After this the contents of the page no longer matter, and we can just > > clear writeback without doing the copy. > > > > But if the request gets stuck in the input queue before being copied > > to userspace, then deadlock can still happen if the server blocks on > > direct reclaim and won't continue with processing the queue. And > > sync(2) will also block in that case.> > > So we'd somehow need to handle stuck WRITE requests. I don't see an > > easy way to do this "on demand", when something actually starts > > waiting on PG_writeback. Alternatively the page copy could be done > > after a timeout, which is ugly, but much easier to implement. > > I think the timeout method would only work if we have already allocated > the pages, under memory pressure page allocation might not work well. > But then this still seems to be a workaround, because we don't take any > less memory with these copied pages. > I'm going to look into mm/ if there isn't a better solution. I've thought a bit about this, and I still don't have a good solution, so I'm going to throw out my random thoughts and see if it helps us get to a good spot. 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use memalloc_*_save/memalloc_*_restore, so instead the process is marked being in these contexts. We could do something similar for FUSE, tho this gets hairy with things that async off request handling to other threads (which is all of the FUSE file systems we have internally). We'd need to have some way to apply this to an entire process group, but this could be a workable solution. 2. Per-request timeouts. This is something we're planning on tackling for other reasons, but it could fit nicely here to say "if this fuse fs has a per-request timeout, skip the copy". That way we at least know we're upper bound on how long we would be "deadlocked". I don't love this approach because it's still a deadlock until the timeout elapsed, but it's an idea. 3. Since we're limiting writeout per the BDI, we could just say FUSE is special, only one memory reclaim related writeout at a time. We flag when we're doing a write via memory reclaim, and then if we try to trigger writeout via memory reclaim again we simply reject it to avoid the deadlock. This has the downside of making it so non-fuse related things that may be triggering direct reclaim through FUSE means they'll reclaim something else, and if the dirty pages from FUSE are the ones causing the problem we could spin a bunch evicting pages that we don't care about and thrashing a bit. As I said all of these have downsides, I think #1 is probably the most workable, but I haven't thought about it super thoroughly. Thanks, Josef