From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4B13C25B76 for ; Wed, 5 Jun 2024 15:36:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 25FEB6B008A; Wed, 5 Jun 2024 11:36:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 236C96B0092; Wed, 5 Jun 2024 11:36:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0FF676B0093; Wed, 5 Jun 2024 11:36:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E1C5E6B008A for ; Wed, 5 Jun 2024 11:36:04 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 8C2341C20EE for ; Wed, 5 Jun 2024 15:36:04 +0000 (UTC) X-FDA: 82197235848.05.132089C Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf13.hostedemail.com (Postfix) with ESMTP id 816FE20007 for ; Wed, 5 Jun 2024 15:35:57 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=toxicpanda-com.20230601.gappssmtp.com header.s=20230601 header.b=tE58PkCm; spf=none (imf13.hostedemail.com: domain of josef@toxicpanda.com has no SPF policy when checking 209.85.214.169) smtp.mailfrom=josef@toxicpanda.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717601760; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nAk75/rNTsxI/8Vj4jK+Zvdo2XN9iI9bePNHVmD2D10=; b=FC/y68xxYkNxubf3kgcNQAQ5Yccyb0CB09rF7KPsCOxc4wz6aYefs64Dt/4lFb2K2/JV8Z uTzZrn/Iv5ckKgVLsrd73zbe17FXfM4WZ9gY5oZYzN5EsHGck9XeoYryMkNu6J87PGSN/v yo2c4BSx2cXNVlcKt5Hc/9imgBL7gUk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717601760; a=rsa-sha256; cv=none; b=3jdf+YTELwWXstfA4gMEPA+jRYxYx2X3fpgcVJSApbP5Xub/VABMf5D/hu1aEj7ij42i2C gcPJiljS2cvmtxtoCDnXLx6OV3WVTcsGRsqMDd94aWowXdslj5w62R5O2z3JzjGBxr4A5z V2ZoO9m1tO3Bf151GM1mJ+RXLcSsieA= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=toxicpanda-com.20230601.gappssmtp.com header.s=20230601 header.b=tE58PkCm; spf=none (imf13.hostedemail.com: domain of josef@toxicpanda.com has no SPF policy when checking 209.85.214.169) smtp.mailfrom=josef@toxicpanda.com; dmarc=none Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-1f6342c5fa8so42965965ad.1 for ; Wed, 05 Jun 2024 08:35:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=toxicpanda-com.20230601.gappssmtp.com; s=20230601; t=1717601755; x=1718206555; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=nAk75/rNTsxI/8Vj4jK+Zvdo2XN9iI9bePNHVmD2D10=; b=tE58PkCmvPLJ2Gzmrb91Y9H5hpQ0v2EeN1c1rXQhy52Iv5aMUCikHQ5re4EpUkKC6F F5g3TBYtYPw8tTuI8+TnHSffWiCvBvP8cQaTAEnN+6BgSoC5NaOBFSvuquT6bTWlMCz+ SkuIULxKZY78pR+vniFqOmaOt4xOaL/dg9O8s5yehWgJZ+QQd+61a/ZXEtCLkMVkXrqx XIaMMc4EBqwZ5IHGAMEENgM89aO6+Y8NJXQRGAYDh93r7CsPpbHadEM71OjWbbPFrgOg SxmiMVEQaTTSeoMCRHS8uHHifZ6vNhKRIKRc9of3ZnZU4j3/pLN9ILIZpeEBL95z1iVC 1Erw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717601755; x=1718206555; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=nAk75/rNTsxI/8Vj4jK+Zvdo2XN9iI9bePNHVmD2D10=; b=bEimu00uMqh7nyrdTfdx6G6HwjMamp6Jn1Co3GXEoOnHN0WPWxOg0Fbpcds9wPERgT eAxnbzNWnBq7joi+CmvTKErLpsmLh1oj2nfwGcI+jqzsrEhn3x5fapIX+GQa6RtoO3p9 N7EhkU74uCFwt283FdRLQpx22Lc+KouRcUbFYYJB3pcwVEjOkDaYx2M1fzvwemQpaKVa yv8fFlyTSyRYGuKZ9/7RrABBJueSndHe0Nxol0K3GvcOKTJMGK7D4wa9cFr7jcVuROYw Q/zjVnMXZwzwYVRlS3f8qdyOeD1s5y+/OdY+XxZUxRFSSmxm64uL0ZnFBgfb7AN+9AdX ax+g== X-Forwarded-Encrypted: i=1; AJvYcCUM57vaZy4VsNTWIOMBfszgW6nv7ypj20lxVvXKRD0aQpIAQxD4edJq/oCcVCc6FDm1hXtlsy8s/JT/etrAybjigtM= X-Gm-Message-State: AOJu0Yz7F5TiV7jbyDyCsNpshQKNBH+JoMm7DhHWEUAsR8pKRVSkVcKZ S+fxddl7NGq4M8mv22wFvnph2XMXX3o4VwLhTt7F4KbUywBXa0LlWReI17+SP8Q= X-Google-Smtp-Source: AGHT+IGDBGJ8ZJCOCV4o7bc6YmUIzOf8ghMtygf47Fad9OJyRtHeX7zioK4Rcra76/1jvDQUwhrw+w== X-Received: by 2002:a17:902:da8d:b0:1f3:4d2:7025 with SMTP id d9443c01a7336-1f6a5a6ab9cmr36668345ad.49.1717601754632; Wed, 05 Jun 2024 08:35:54 -0700 (PDT) Received: from localhost ([2620:10d:c090:600::1:eaa0]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1f6323570f5sm103535245ad.75.2024.06.05.08.35.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Jun 2024 08:35:54 -0700 (PDT) Date: Wed, 5 Jun 2024 11:35:52 -0400 From: Josef Bacik To: Amir Goldstein Cc: Bernd Schubert , Miklos Szeredi , Jingbo Xu , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , lege.wang@jaguarmicro.com, "Matthew Wilcox (Oracle)" , "linux-mm@kvack.org" Subject: Re: [HELP] FUSE writeback performance bottleneck Message-ID: <20240605153552.GB21567@localhost.localdomain> References: <2f834b5c-d591-43c5-86ba-18509d77a865@fastmail.fm> <21741978-a604-4054-8af9-793085925c82@fastmail.fm> <20240604165319.GG3413@localhost.localdomain> <6853a389-031b-4bd6-a300-dea878979d8c@fastmail.fm> <20240604221654.GA17503@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 816FE20007 X-Rspam-User: X-Stat-Signature: ibxmase4wdqtbpiw4a85sdi7f954hock X-HE-Tag: 1717601757-733819 X-HE-Meta: U2FsdGVkX1/GSwBvFVAQOYIACbGcgCPsLlcrC0mOXwvInJGyAdx5l5Qfk68qJsxR4xiT1s7V8C0Ja4oNR24XSPsQQClmbWCVeNzVVLaJxTWKaCWjNLjTNT32VUa/CZ9DHIxGh5fGFGnk4N5PPS0X8aR+zdyGcNlZxNcTtmEYIIxLsusoSjN7rHlvyaqsvqPXLc4+oWb2Gkz25fn2/oVDp60GHdzR3HLxTXl/ke2Bd2YjduS9vJGkfk9HOlvMgNgyAlkKgCGr6/rnTSmW12+4n2ulsHVrdi2jzvgc/QTrZ3yGIiwj6/vAn+WGu/yoQ2R/SB12pHeu0LB/wwRu7UYwy3wcuEj7SElnUUBzOs9OtU06EmUsM2/958fKKDrZ9Vv0QMEK1A5zhrC4RJrkujuKxYg+A2TYBpkFdAVg/elr97ExFhLn9wwB4T3Qus7n9JJV6gp3D2xFwvmqwuJJ4ipZ2lmO8CE7w6MadDcpNNPeyjslNzx5tPM5GSwL5xYww2fV5cthwiKPczAMWQ9Meq4jwNqJABprgODxO2Nt0Hl19mrKI5AzVTy37GNAIz98C2xkjSp0nWQ+IIDBz13Ira0mk61zD96aGxTSO84smZBhUUQvF1cXH1HLWBViZXRSQ4UI10uiE5hT1CNcnltD/ENtgJ/iQi4iAiCzmPMAukdNJ/r0BrQjx9AjB7ZKEzXoItw2sHsJu8z36xxaKj7qoWENltBrIaSkugxkPg9V8hZq7SXBv5Kb+c7Sd1sRKsAwN2gV37W0Ai+AEI3vb8KXcTnMwXjs5LZszHA/pQwl4Jr/lGiJAjWH7VPHwPVlH6CvSCsH95gLgWhE+E6wMCza5SmgxxrNjvIHKpvpQKGkC642npJ6phAds91euc1N1IzDLFkUPtK8B7cC/vz+8zdbarZP3KAtcop0lyY8Mj+XnYZAde91Ko/5P0NeCUujvTdFXlWpC77h0yjOa7VMdW9hqLB 5UcUVHOH IeCtAMvE1qcDizlUiQIqGGL8X3ZFSWjGWIPoo4bCFPleeiYwEkE6ka1pGihbP0sJvVcIthYYVrSnl3fOMeTRQBfyg6544wIKWvIb2qelyNoevPZoQGGxeQ+tAliJS8FVf/42z84du1UlK8yQJkKuRwN0SLSChhuzc4qTfV3RTconyJtUKcicFuxKnSaFzX02E+TcH4cAl9aW1xbS7xBgt9CPqnm15AvJkfXD4qsIRbX9hX9N5+x0CD+k5/Si4BIhnwHPhl1ex4im9oA5i6eV0SuDcqrn3UkU9nmUDuhV5hhD5fxLR/rcRnUjfHtLMBKIK1ncVj0PMCJRfC5NKe3VKqxQZtdpfEby5FwNrqKS2/gfFr1Yb4Ab7hMw/atGsrfTwewCaik2hhuoeMZlT9AmOcxglPqTltmlbQIV2Kup2CaQy5cjD5CFDMJZ0dBaV5fLDL969DkX1k5kOo76PNxHCatdl0J1kbPelyS57HI0FTMCblW9+Ga5yfenmMjv0j7vRBvlbKYrlkw/Pqfs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jun 05, 2024 at 08:49:48AM +0300, Amir Goldstein wrote: > On Wed, Jun 5, 2024 at 1:17 AM Josef Bacik wrote: > > > > On Tue, Jun 04, 2024 at 11:39:17PM +0200, Bernd Schubert wrote: > > > > > > > > > On 6/4/24 18:53, Josef Bacik wrote: > > > > On Tue, Jun 04, 2024 at 04:13:25PM +0200, Bernd Schubert wrote: > > > >> > > > >> > > > >> On 6/4/24 12:02, Miklos Szeredi wrote: > > > >>> On Tue, 4 Jun 2024 at 11:32, Bernd Schubert wrote: > > > >>> > > > >>>> Back to the background for the copy, so it copies pages to avoid > > > >>>> blocking on memory reclaim. With that allocation it in fact increases > > > >>>> memory pressure even more. Isn't the right solution to mark those pages > > > >>>> as not reclaimable and to avoid blocking on it? Which is what the tmp > > > >>>> pages do, just not in beautiful way. > > > >>> > > > >>> Copying to the tmp page is the same as marking the pages as > > > >>> non-reclaimable and non-syncable. > > > >>> > > > >>> Conceptually it would be nice to only copy when there's something > > > >>> actually waiting for writeback on the page. > > > >>> > > > >>> Note: normally the WRITE request would be copied to userspace along > > > >>> with the contents of the pages very soon after starting writeback. > > > >>> After this the contents of the page no longer matter, and we can just > > > >>> clear writeback without doing the copy. > > > >>> > > > >>> But if the request gets stuck in the input queue before being copied > > > >>> to userspace, then deadlock can still happen if the server blocks on > > > >>> direct reclaim and won't continue with processing the queue. And > > > >>> sync(2) will also block in that case.> > > > >>> So we'd somehow need to handle stuck WRITE requests. I don't see an > > > >>> easy way to do this "on demand", when something actually starts > > > >>> waiting on PG_writeback. Alternatively the page copy could be done > > > >>> after a timeout, which is ugly, but much easier to implement. > > > >> > > > >> I think the timeout method would only work if we have already allocated > > > >> the pages, under memory pressure page allocation might not work well. > > > >> But then this still seems to be a workaround, because we don't take any > > > >> less memory with these copied pages. > > > >> I'm going to look into mm/ if there isn't a better solution. > > > > > > > > I've thought a bit about this, and I still don't have a good solution, so I'm > > > > going to throw out my random thoughts and see if it helps us get to a good spot. > > > > > > > > 1. Generally we are moving away from GFP_NOFS/GFP_NOIO to instead use > > > > memalloc_*_save/memalloc_*_restore, so instead the process is marked being in > > > > these contexts. We could do something similar for FUSE, tho this gets hairy > > > > with things that async off request handling to other threads (which is all of > > > > the FUSE file systems we have internally). We'd need to have some way to > > > > apply this to an entire process group, but this could be a workable solution. > > > > > > > > > > I'm not sure how either of of both (GFP_ and memalloc_) would work for > > > userspace allocations. > > > Wouldn't we basically need to have a feature to disable memory > > > allocations for fuse userspace tasks? Hmm, maybe through mem_cgroup. > > > Although even then, the file system might depend on other kernel > > > resources (backend file system or block device or even network) that > > > might do allocations on their own without the knowledge of the fuse server. > > > > > > > Basically that only in the case that we're handling a request from memory > > pressure we would invoke this, and then any allocation would automatically have > > gfp_nofs protection because it's flagged at the task level. > > > > Again there's a lot of problems with this, like how do we set it for the task, > > how does it work for threads etc. > > > > > > 2. Per-request timeouts. This is something we're planning on tackling for other > > > > reasons, but it could fit nicely here to say "if this fuse fs has a > > > > per-request timeout, skip the copy". That way we at least know we're upper > > > > bound on how long we would be "deadlocked". I don't love this approach > > > > because it's still a deadlock until the timeout elapsed, but it's an idea. > > > > > > Hmm, how do we know "this fuse fs has a per-request timeout"? I don't > > > think we could trust initialization flags set by userspace. > > > > > > > It would be controlled by the kernel. So at init time the fuse file system says > > "my command timeout is 30 minutes." Then the kernel enforces this by having a > > per-request timeout, and once that 30 minutes elapses we cancel the request and > > EIO it. User space doesn't do anything beyond telling the kernel what it's > > timeout is, so this would be safe. > > > > Maybe that would be better to configure by mounter, similar to nfs -otimeo > and maybe consider opt-in to returning ETIMEDOUT in this case. > At least nfsd will pass that error to nfs client and nfs client will retry. > > Different applications (or network protocols) handle timeouts differently, > so the timeout and error seems like a decision for the admin/mounter not > for the fuse server, although there may be a fuse fs that would want to > set the default timeout, as if to request the kernel to be its watchdog > (i.e. do not expect me to take more than 30 min to handle any request). Oh yeah for sure, I'm just saying for the purposes of allowing the FUSE daemon to be a little riskier with system resources we base it off of wether it opts in to command timeouts. My plans are to have it be able to be set by the fuse daemon, or externally by a sysadmin via sysfs. Thanks, Josef