From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91709C4708B for ; Thu, 27 May 2021 06:24:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E2C6C610CC for ; Thu, 27 May 2021 06:24:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E2C6C610CC Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E78286B0036; Thu, 27 May 2021 02:24:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E28B16B006E; Thu, 27 May 2021 02:24:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CA1F66B0070; Thu, 27 May 2021 02:24:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0059.hostedemail.com [216.40.44.59]) by kanga.kvack.org (Postfix) with ESMTP id 8EA7B6B0036 for ; Thu, 27 May 2021 02:24:15 -0400 (EDT) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 1DF5F8249980 for ; Thu, 27 May 2021 06:24:15 +0000 (UTC) X-FDA: 78186021270.17.6F13AAE Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf15.hostedemail.com (Postfix) with ESMTP id 8664DA0001C8 for ; Thu, 27 May 2021 06:24:09 +0000 (UTC) Received: by mail-pl1-f179.google.com with SMTP id b7so1847701plg.0 for ; Wed, 26 May 2021 23:24:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=sOTZxfCEr4ZNWQOtZn8X4+7qZY2gbC2MR3ZoQCk6Qkc=; b=vvZ3oWYZXlqGw30FL5OCcoEShT4Jd9B4fAD2B1abrSVHonScGSrRJKqspTK63Lygb+ mIorpIrfSts5eSKAl9Znz+vaa2zuIrlLJlKQ218nkf902HldyO47p4IGwK6+Yq+VzW8Y uPNozKtSea8hI+zcEAEFn++fd+gFEI1j6DRt4HRghdZW+TUWRLhhugkrpZgh2D8yT05U yibV6+AlhQVlkSS+0+bi9n/GU3eaZNPg+ooulu01TSp9sdGMO1uOueRFRf0/vVidBDRN 7ZqqWUjFBvdJtSwuVtTE9lJyl6THSEV27RAIADbnTGVSBXTr0DT/wvPDvXrcHEL8tKwa QX/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=sOTZxfCEr4ZNWQOtZn8X4+7qZY2gbC2MR3ZoQCk6Qkc=; b=ITxKJQ5WEww+snqlyJmFp16C9UxwG9EQvN/DWvLQCY0CqJahEaP3D1RUJ3E5LkpRMf bHmPhoi4CNh0Po8XiYfVlq7GxssglNLZTYI2HRrpvSEl/OeDDSuVqt7CqFR5nWldNT3j JTejqUcfZU69/fzJipINZJYzrendFX1DinpGJ2B1z7qa5jVORe5JqCFZEbrusDzET3S+ 681nSinekWsRgQ2D6cuYlHU/JSBSKAJKNA252V55xZkyhTnvMY5J6bX1idZOtuz/RSQ3 fj5smFr5tKzLb+cP7u/qNIxlzvAgNLdQSotFhgZ3GhMFJNQWmIG7hPilH9rDdWZvBkw0 cqYA== X-Gm-Message-State: AOAM5333hfuKNulCkU6OzyyWfy+YY301BA2Knbkospg5OqbrV08XL6pM 2gm1PSgyjM78+ruxlWen8U3d5Q== X-Google-Smtp-Source: ABdhPJztM2gdTpq3LWt1dphs56P5VbOl0qn5Dk80JuZ1V4JijOEM3mtuwuEEXzhP7u5fnq0AqYqtbg== X-Received: by 2002:a17:903:1c3:b029:f4:9624:2c69 with SMTP id e3-20020a17090301c3b02900f496242c69mr1872542plh.51.1622096652636; Wed, 26 May 2021 23:24:12 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.254]) by smtp.gmail.com with ESMTPSA id m5sm882971pgl.75.2021.05.26.23.24.04 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 26 May 2021 23:24:12 -0700 (PDT) From: Muchun Song To: willy@infradead.org, akpm@linux-foundation.org, hannes@cmpxchg.org, mhocko@kernel.org, vdavydov.dev@gmail.com, shakeelb@google.com, guro@fb.com, shy828301@gmail.com, alexs@kernel.org, richard.weiyang@gmail.com, david@fromorbit.com, trond.myklebust@hammerspace.com, anna.schumaker@netapp.com Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-nfs@vger.kernel.org, zhengqi.arch@bytedance.com, duanxiongchun@bytedance.com, fam.zheng@bytedance.com, Muchun Song Subject: [PATCH v2 00/21] Optimize list lru memory consumption Date: Thu, 27 May 2021 14:21:27 +0800 Message-Id: <20210527062148.9361-1-songmuchun@bytedance.com> X-Mailer: git-send-email 2.21.0 (Apple Git-122) MIME-Version: 1.0 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=bytedance-com.20150623.gappssmtp.com header.s=20150623 header.b=vvZ3oWYZ; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf15.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com X-Stat-Signature: 8moh51u6kaakxzokigyec19figw6hi9b X-Rspamd-Queue-Id: 8664DA0001C8 X-Rspamd-Server: rspam02 X-HE-Tag: 1622096649-330232 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In our server, we found a suspected memory leak problem. The kmalloc-32 consumes more than 6GB of memory. Other kmem_caches consume less than 2GB memory. After our in-depth analysis, the memory consumption of kmalloc-32 slab cache is the cause of list_lru_one allocation. crash> p memcg_nr_cache_ids memcg_nr_cache_ids =3D $2 =3D 24574 memcg_nr_cache_ids is very large and memory consumption of each list_lru can be calculated with the following formula. num_numa_node * memcg_nr_cache_ids * 32 (kmalloc-32) There are 4 numa nodes in our system, so each list_lru consumes ~3MB. crash> list super_blocks | wc -l 952 Every mount will register 2 list lrus, one is for inode, another is for dentry. There are 952 super_blocks. So the total memory is 952 * 2 * 3 MB (~5.6GB). But now the number of memory cgroups is less than 500. So I guess more than 12286 memory cgroups have been created on this machine (I do not know why there are so many cgroups, it may be a user's bug or the user really want to do that). Because memcg_nr_cache_ids has not been reduced to a suitable value. This can waste a lot of memory. If we want to reduce memcg_nr_cache_ids, we have to reboot the server. This is not what we want. In order to reduce memcg_nr_cache_ids, I had posted a patchset [1] to do this. But this did not fundamentally solve the problem. We currently allocate scope for every memcg to be able to tracked on ever= y superblock instantiated in the system, regardless of whether that superbl= ock is even accessible to that memcg. These huge memcg counts come from container hosts where memcgs are confin= ed to just a small subset of the total number of superblocks that instantiat= ed at any given point in time. For these systems with huge container counts, list_lru does not need the capability of tracking every memcg on every superblock. What it comes down to is that the list_lru is only needed for a given mem= cg if that memcg is instatiating and freeing objects on a given list_lru. As Dave said, "Which makes me think we should be moving more towards 'add= the memcg to the list_lru at the first insert' model rather than 'instantiate all at memcg init time just in case'." This patchset aims to optimize the list lru memory consumption from diffe= rent aspects. Patch 1-6 are code simplification. Patch 7 converts the array from per-memcg per-node to per-memcg Patch 8 is code simplification. Patch 9-15 let list_lru allocation dynamically. Patch 16 is code cleanup. Patch 17 use xarray to optimize per memcg pointer array size. Patch 18-21 is code simplification. I had done a easy test to show the optimization. I create 10k memory cgro= ups and mount 10k filesystems in the systems. We use free command to show how= many memory does the systems comsumes after this operation (There are 2 numa n= odes in the system). +-----------------------+------------------------+ | condition | memory consumption | +-----------------------+------------------------+ | without this patchset | 24464 MB | +-----------------------+------------------------+ | after patch 7 | 21957 MB | <--------+ +-----------------------+------------------------+ | | after patch 15 | 6895 MB | | +-----------------------+------------------------+ | | after patch 17 | 4367 MB | | +-----------------------+------------------------+ | | The more the number of nodes, the more obvious the effect---+ BTW, there was a recent discussion [2] on the same issue. [1] https://lore.kernel.org/linux-fsdevel/20210428094949.43579-1-songmuch= un@bytedance.com/ [2] https://lore.kernel.org/linux-fsdevel/20210405054848.GA1077931@in.ibm= .com/ Changelog in v2: 1. Update Documentation/filesystems/porting.rst suggested by Dave. 2. Add a comment above alloc_inode_sb() suggested by Dave. 2. Rework some patch's commit log. 3. Add patch 18-21. Thanks Dave. Muchun Song (21): mm: list_lru: fix list_lru_count_one() return value mm: memcontrol: remove kmemcg_id reparenting mm: memcontrol: remove the kmem states mm: memcontrol: do it in mem_cgroup_css_online to make the kmem online mm: list_lru: remove lru node locking from memcg_update_list_lru_node mm: list_lru: only add the memcg aware lrus to the list_lrus mm: list_lru: optimize the array of per memcg lists memory consumption mm: list_lru: remove memcg_aware field from struct list_lru mm: introduce kmem_cache_alloc_lru fs: introduce alloc_inode_sb() to allocate filesystems specific inode mm: dcache: use kmem_cache_alloc_lru() to allocate dentry xarray: use kmem_cache_alloc_lru to allocate xa_node mm: workingset: use xas_set_lru() to pass shadow_nodes nfs42: use a specific kmem_cache to allocate nfs4_xattr_entry mm: list_lru: allocate list_lru_one only when needed mm: list_lru: rename memcg_drain_all_list_lrus to memcg_reparent_list_lrus mm: list_lru: replace linear array with xarray mm: memcontrol: reuse memory cgroup ID for kmem ID mm: memcontrol: fix cannot alloc the maximum memcg ID mm: list_lru: rename list_lru_per_memcg to list_lru_memcg mm: memcontrol: rename memcg_cache_id to memcg_kmem_id Documentation/filesystems/porting.rst | 5 + drivers/dax/super.c | 2 +- fs/9p/vfs_inode.c | 2 +- fs/adfs/super.c | 2 +- fs/affs/super.c | 2 +- fs/afs/super.c | 2 +- fs/befs/linuxvfs.c | 2 +- fs/bfs/inode.c | 2 +- fs/block_dev.c | 2 +- fs/btrfs/inode.c | 2 +- fs/ceph/inode.c | 2 +- fs/cifs/cifsfs.c | 2 +- fs/coda/inode.c | 2 +- fs/dcache.c | 3 +- fs/ecryptfs/super.c | 2 +- fs/efs/super.c | 2 +- fs/erofs/super.c | 2 +- fs/exfat/super.c | 2 +- fs/ext2/super.c | 2 +- fs/ext4/super.c | 2 +- fs/f2fs/super.c | 2 +- fs/fat/inode.c | 2 +- fs/freevxfs/vxfs_super.c | 2 +- fs/fuse/inode.c | 2 +- fs/gfs2/super.c | 2 +- fs/hfs/super.c | 2 +- fs/hfsplus/super.c | 2 +- fs/hostfs/hostfs_kern.c | 2 +- fs/hpfs/super.c | 2 +- fs/hugetlbfs/inode.c | 2 +- fs/inode.c | 2 +- fs/isofs/inode.c | 2 +- fs/jffs2/super.c | 2 +- fs/jfs/super.c | 2 +- fs/minix/inode.c | 2 +- fs/nfs/inode.c | 2 +- fs/nfs/nfs42xattr.c | 95 ++++---- fs/nilfs2/super.c | 2 +- fs/ntfs/inode.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 2 +- fs/ocfs2/super.c | 2 +- fs/openpromfs/inode.c | 2 +- fs/orangefs/super.c | 2 +- fs/overlayfs/super.c | 2 +- fs/proc/inode.c | 2 +- fs/qnx4/inode.c | 2 +- fs/qnx6/inode.c | 2 +- fs/reiserfs/super.c | 2 +- fs/romfs/super.c | 2 +- fs/squashfs/super.c | 2 +- fs/sysv/inode.c | 2 +- fs/ubifs/super.c | 2 +- fs/udf/super.c | 2 +- fs/ufs/super.c | 2 +- fs/vboxsf/super.c | 2 +- fs/xfs/xfs_icache.c | 3 +- fs/zonefs/super.c | 2 +- include/linux/fs.h | 11 + include/linux/list_lru.h | 18 +- include/linux/memcontrol.h | 48 ++-- include/linux/slab.h | 4 + include/linux/swap.h | 5 +- include/linux/xarray.h | 9 +- ipc/mqueue.c | 2 +- lib/xarray.c | 10 +- mm/list_lru.c | 447 +++++++++++++++++-----------= ------ mm/memcontrol.c | 185 ++------------ mm/shmem.c | 2 +- mm/slab.c | 39 ++- mm/slab.h | 17 +- mm/slub.c | 42 ++-- mm/workingset.c | 2 +- net/socket.c | 2 +- net/sunrpc/rpc_pipe.c | 2 +- 74 files changed, 480 insertions(+), 577 deletions(-) --=20 2.11.0