From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4DC95C04FFE for ; Tue, 14 May 2024 20:49:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D89058D0049; Tue, 14 May 2024 16:49:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D38208D000D; Tue, 14 May 2024 16:49:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C274F8D0049; Tue, 14 May 2024 16:49:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A412F8D000D for ; Tue, 14 May 2024 16:49:02 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 690BAA2221 for ; Tue, 14 May 2024 20:49:02 +0000 (UTC) X-FDA: 82118190924.06.83937E5 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) by imf25.hostedemail.com (Postfix) with ESMTP id 794FCA0017 for ; Tue, 14 May 2024 20:48:59 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=OGQpwThJ; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf25.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.174 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1715719739; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vjO7VnrroqxxgnD3OiMZO9Nx30PMYAUaL1ENaC7gzp0=; b=hsyY+54b9S22piXM2Q+OPudoeByvZqufWftaf0y5wW7HNHp3q5DpX+PVL5XmSISiAws0x1 1i1/8lFrSi7ACHqPv3YQo84l9s3FpLRj5MUVy33nzbyuH+eiZKrvzhY8P3TWqD9GjN8WBy vLap69REDKq2bffeoFQ3tt0f2Tbp6/c= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=OGQpwThJ; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf25.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.174 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715719739; a=rsa-sha256; cv=none; b=q4IK35vdsdYcyQ1JSZauZbBVTtnPBk1itURGLZ1NOBvCmZhgDUM72il2J8wVxG4GuoAkaA 9f88+/nrFPAejtq+0bm+J5MpZhFDBp4InKbkZlRGAvVOH8Rt1N2TiTi6KoO4N6L3Yn9pkZ 8H9Y+2oDb3XMpgBifqjK4v27vX15M/4= Date: Tue, 14 May 2024 13:48:52 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1715719737; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=vjO7VnrroqxxgnD3OiMZO9Nx30PMYAUaL1ENaC7gzp0=; b=OGQpwThJqbh+faErC9luLSfbckj35Wc0Ehw3+nSaXp9SeE7K4BbMHDB74xYP0iRy49c8iV 0FPeMy9LX1E2pl9GAgAFJ4KQq6jO9kP2IghKENNjf3jOeLwUzPt5S7bt2ZGWSeumNApQDp KHHP+Ef5Nqvk58wjoPRIH+AgL9SKHUQ= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Rik van Riel , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH] mm: vmscan: restore incremental cgroup iteration Message-ID: References: <20240514202641.2821494-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240514202641.2821494-1-hannes@cmpxchg.org> X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 794FCA0017 X-Stat-Signature: akqkddmkkoy3oehixqpjaimxmydc93mb X-HE-Tag: 1715719739-323934 X-HE-Meta: U2FsdGVkX1/LuteTyusFPkXpaRMvwIm3UjA4K31hgLF5UqGWYQjXxJ8an5UX+TGo/YSdb4d4sVNV2Iy6bIz4EwUvo4tYdX0psIBcKheidOkA12UnV3rBPX3nscHrA08oJTpsWgAD86oVI83W4pACsyXv9C1vK1aksj0X0kkLDhVtCsYv2u44K8OKcHe+02garUkN0pTLDMIP/AEZiD8oXskq8BcjPEhjggo3BZUxMxVXqG4ck5SHmDnjCpIFHK/dUUQaFDdDSRmlzM8ZU5yXFnWi+SsMy7vLiK0KqszlQt7/hnFKR18b+cAOxem5UPbYdO6iXCeWby9NdK0+cq/v0g/uBJLVMV5VR1zm+D/hX+pq2CFCzrVV/D9qNj8buaQcztYn3vBm9Odwt+WXNpNYNYcsw2mYSzg/JP9GW0BfgiBEHh+H/h+lq3vc0MuB3Tn8DMwdrJF1BElK8ZzyyGyNrRShuSW3QhaKY5IHqLXCWT4xItaCAJSYVvqaA38qOpXJZqtO8GWPJBnDVxxlvHJ/myWZiXQz2eOSfbUBUOJ4Oe15Kt5sCXeQYERjoxCHpf11bDJIPqnH1itJZ7Gh5MaeOOtBNijFEWlOOVQ0Ng2Izv320EHNQOD8kB7Ivz6Ew/Qdm12ZWlGz1Wzrzf7UE1pPBSSQET0ZK/1r6ng7vmx4Je5gPZIXbpvhAZIjWZHSAQIUSwgoQA2dA6yDlvyoJKHKRPEauOWWG9ZcEq6+VLnzXqbSYHo0085vu8v1n2AAsXPin4TS9eiVZxCzUV2pweI6GSSxuE68HOmsm4GgOAN5ElH9qv0Vt0oDHflJzA9/Ru/XsCo3wyd9LPOTTuzkASmbxTZeqAtSmlFo/j+ROl7xOs9qwqEc1G3AKjHNIb6Bn3DvHEKqVixGMNGfI30cj/jtLfpHohwEdJE3ISDCfwcC7x16iw46vcnY5/kjP0FaOY+ySUtWQLeTJLiIThpk/fB 2vdmBvRd 1tHNLTr37ioxhbhdBXauO0PkrtIMhsLFI6agRvDj7rI4oG7SUMKaGYI11wUYws7tMFXCPPTe56mit1qrCUoO7arc2MSiUJk/TI+wwOTVkE3gacq7RRge10bq9z7/YHNaWC8lbWgUxbZ+4RQlc8IXgq5rDOP+E0rL7Mj2lQB5i1QRqhQVBXw8yOXwFbpPcuFt0kU4i1tlwJpmAQxk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 14, 2024 at 04:26:41PM -0400, Johannes Weiner wrote: > Currently, reclaim always walks the entire cgroup tree in order to > ensure fairness between groups. While overreclaim is limited in > shrink_lruvec(), many of our systems have a sizable number of active > groups, and an even bigger number of idle cgroups with cache left > behind by previous jobs; the mere act of walking all these cgroups can > impose significant latency on direct reclaimers. > > In the past, we've used a save-and-restore iterator that enabled > incremental tree walks over multiple reclaim invocations. This ensured > fairness, while keeping the work of individual reclaimers small. > > However, in edge cases with a lot of reclaim concurrency, individual > reclaimers would sometimes not see enough of the cgroup tree to make > forward progress and (prematurely) declare OOM. Consequently we > switched to comprehensive walks in 1ba6fc9af35b ("mm: vmscan: do not > share cgroup iteration between reclaimers"). > > To address the latency problem without bringing back the premature OOM > issue, reinstate the shared iteration, but with a restart condition to > do the full walk in the OOM case - similar to what we do for > memory.low enforcement and active page protection. > > In the worst case, we do one more full tree walk before declaring > OOM. But the vast majority of direct reclaim scans can then finish > much quicker, while fairness across the tree is maintained: > > - Before this patch, we observed that direct reclaim always takes more > than 100us and most direct reclaim time is spent in reclaim cycles > lasting between 1ms and 1 second. Almost 40% of direct reclaim time > was spent on reclaim cycles exceeding 100ms. > > - With this patch, almost all page reclaim cycles last less than 10ms, > and a good amount of direct page reclaim finishes in under 100us. No > page reclaim cycles lasting over 100ms were observed anymore. > > The shared iterator state is maintaned inside the target cgroup, so > fair and incremental walks are performed during both global reclaim > and cgroup limit reclaim of complex subtrees. > > Reported-by: Rik van Riel > Signed-off-by: Johannes Weiner > Signed-off-by: Rik van Riel Reviewed-by: Shakeel Butt