From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 511F8C433F5 for ; Wed, 5 Oct 2022 01:18:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D34766B0072; Tue, 4 Oct 2022 21:18:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CE55E6B0073; Tue, 4 Oct 2022 21:18:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B86E18E0006; Tue, 4 Oct 2022 21:18:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A3CFF6B0072 for ; Tue, 4 Oct 2022 21:18:18 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 72A5C160D41 for ; Wed, 5 Oct 2022 01:18:18 +0000 (UTC) X-FDA: 79985135076.08.98F712B Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54]) by imf06.hostedemail.com (Postfix) with ESMTP id 137DA180010 for ; Wed, 5 Oct 2022 01:18:17 +0000 (UTC) Received: by mail-wr1-f54.google.com with SMTP id bu30so3867304wrb.8 for ; Tue, 04 Oct 2022 18:18:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:from:to:cc:subject:date; bh=wKEtZiXDz87n8b96HSwoP0cLh0C7jyCCww9QlL2G0QQ=; b=dQCLd9ZJb+EU+6pDOHcUcK51l9MyM2MRIxnvcivW7rzZ5K/tAQ2zNSu642hI3JM2tm oFje2cbkO1YyMErawMODJhBrkUVw6tG/bK8qdxXgZ7SmhzKzvLiVErxtfAuW1Jc4lWgf lY+3hMTG7uq1GJagqorBTId8QPuvQzyIo1781rgDMoGMklyq/NJLZ5N8tjHwF09fR+iv XVx8J/+8UXCN68Av4/LqigIqaeo5xgrXWfDeXXkJ30BY0j2R2EU9m+ZF0h2hrMxw0BDk hYnxwiZcA3Ff5WWqFl+hhLdf3IBJBIDEWjf6iE2nPNHgPKboiGl5uwL7mOm6nqkjZbRa lRUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=wKEtZiXDz87n8b96HSwoP0cLh0C7jyCCww9QlL2G0QQ=; b=4JWRIeyUNcigMVHstGgl5H3hpYB/dBOIes4kTBg345esUsstJv+l9TnOCBYYTjHShd u0PLw1l2fF8Klw+O17T83sOnIty3elL5JHkX4jjSwx6K+op3ejV+V4Ie06z9pnuJkFlV tILceWvppLn4f8lom4uZYQBWsu2GTUUc8nL4v53ioyYZZo7r7L1DxlK1qGD1Rz3qmrdP rXfH0EpcGeNdAzixhAHXPvmJImt+y6Oh7gWFlimRdjZRK1xCVfLYit5nA2KqHjFpTqAl WL0yLtmu8e7AWWS+1qyLEzWSj9WT1cSweQd4lO7YoMOilQEZRawOI+G9k6rLYnvE9ARe QqOA== X-Gm-Message-State: ACrzQf03ioO+wo35vYUwivq2dT1Oy50aVV44Cs6Szt/Uiz5wt5qu6WJZ tbLr+sUYktkvCpAPkkEu9ZNo0Uvp5FRIez7hE7C79w== X-Google-Smtp-Source: AMsMyM6OGP/s1SpALxv2OkDHEUxe9vopUQOnSoETGSJG4EwxZQ9suMXs+u4+HeaI9Wv2KX/5JpyDQRnOwvoXv5Iy6ck= X-Received: by 2002:a5d:6741:0:b0:22e:2c5c:d611 with SMTP id l1-20020a5d6741000000b0022e2c5cd611mr11869611wrw.210.1664932696423; Tue, 04 Oct 2022 18:18:16 -0700 (PDT) MIME-Version: 1.0 From: Yosry Ahmed Date: Tue, 4 Oct 2022 18:17:40 -0700 Message-ID: Subject: [RFC] memcg rstat flushing optimization To: Tejun Heo , Zefan Li , Johannes Weiner , Michal Hocko , Shakeel Butt , Roman Gushchin , =?UTF-8?Q?Michal_Koutn=C3=BD?= Cc: Andrew Morton , Linux-MM , Cgroups , Greg Thelen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1664932698; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=wKEtZiXDz87n8b96HSwoP0cLh0C7jyCCww9QlL2G0QQ=; b=d9nsqWq1NxLp7qaFhXScHg919KWfFWO+UP2thy8fioct7PzHn6jX8gG7sudhDPeEJtL49Q vLoHwo2WzhpgDRpGwveF5pGHq9m/ziHXpUBuB6OjA7kBM+lz8PvePi/QcIFXWVnrkjlu0U 6dY+ocQHbZoP6+2XEBN+gQEUqxFo2yA= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dQCLd9ZJ; spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1664932698; a=rsa-sha256; cv=none; b=dQiDd9pk24eJd9m+U2R7qYoP12PFIn2FJhQMz5hNgHIZe+1XMREz1uSIFsKaxneebMCv/s 3jTUtks/HBD0jtNJ7pK0ASOTPzSXsRuj6XyT7P0PAngQY9d8FvSEffADw3u8zsGS5ThYb7 sYKtkGniGTux9+bogkY8ff2Do4sRc14= X-Stat-Signature: pbd3fjcmtuxkpyo6me6kduihcdgekkt5 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 137DA180010 Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=dQCLd9ZJ; spf=pass (imf06.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com X-Rspam-User: X-HE-Tag: 1664932697-672788 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey everyone, Sorry for the long email :) We have recently ran into a hard lockup on a machine with hundreds of CPUs and thousands of memcgs during an rstat flush. There have also been some discussions during LPC between myself, Michal Koutn=C3=BD, and Shakeel about memcg rstat flushing optimization. This email is a follow up on that, discussing possible ideas to optimize memcg rstat flushing. Currently, mem_cgroup_flush_stats() is the main interface to flush memcg stats. It has some internal optimizations that can skip a flush if there hasn't been significant updates in general. It always flushes the entire memcg hierarchy, and always invokes flushing using cgroup_rstat_flush_irqsafe(), which has interrupts disabled and does not sleep. As you can imagine, with a sufficiently large number of memcgs and cpus, a call to mem_cgroup_flush_stats() might be slow, or in an extreme case like the one we ran into, cause a hard lockup (despite periodically flushing every 4 seconds). (a) A first step might be to introduce a non _irqsafe version of mem_cgroup_flush_stats(), and only call the _irqsafe version in places where we can't sleep. This will exclude some contexts from possibly introducing a lockup, like the stats reading context and the periodic flushing context. (b) We can also stop flushing the entire memcg hierarchy in hopes that flushing might happen incrementally over subtrees, but this was introduced to reduce lock contention when there are multiple contexts trying to flush memcgs stats concurrently, where only one of them will flush and all the others return immediately (although there is some inaccuracy here as we didn't actually wait for the flush to complete). This will re-introduce the lock contention. Maybe we can mitigate this in rstat code by having hierarchical locks instead of a global lock, although I can imagine this can quickly get too complicated. (c) One other thing we can do (similar to the recent blkcg patch series [1]) is keep track of which stats have been updated. We currently flush MEMCG_NR_STATS + MEMCG_NR_EVENTS (thanks to Shakeel) + nodes * NR_VM_NODE_STAT_ITEMS. I didn't make the exact calculation but I suspect this easily goes over a 100. Keeping track of updated stats might be in the form of a percpu bitmask. It will introduce some overhead to the update side and flush sides, but it can help us skip a lot of up-to-date stats and cache misses. In a few sample machines I have found that every (memcg, cpu) pair had less than 5 stats on average that are actually updated. (d) Instead of optimizing rstat flushing in general, we can just mitigate the cases that can actually cause a lockup. After we do (a) and separate call sites that actually need to disable interrupts, we can introduce a new selective flush callback (e.g. cgroup_rstat_flush_opts()). This callback can flush only the stats we care about (bitmask?) and leave the rstat tree untouched (only traverse the tree, don't pop the nodes). It might be less than optimal in cases where the stats we choose to flush are the only ones that are updated, and the cgroup just remains on the rstat tree for no reason. However, it effectively addresses the cases that can cause a lockup by only flushing a small subset of the stats. (e) If we do both (c) and (d), we can go one step further. We can make cgroup_rstat_flush_opts() return a boolean to indicate whether this cgroup is completely flushed (what we asked to flush is all what was updated). If true, we can remove the cgroup from the rstat tree. However, to do this we will need to have separate rstat trees for each subsystem or to keep track of which subsystems have updates for a cgroup (so that if cgroup_rstat_flush_opts() returns true we know if we can remove the cgroup from the tree or not). Of course nothing is free. Most of the solutions above will either introduce overhead somewhere, complexity, or both. We also don't have a de facto benchmark that will tell us for sure if a change made things generally better or not, as it will vastly differ depending on the setup, the workloads, etc. Nothing will make everything better for all use cases. This is just me kicking off a discussion to see what we can/should do :) [1] https://lore.kernel.org/lkml/20221004151748.293388-1-longman@redhat.com= /