From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4E10C432BE for ; Thu, 26 Aug 2021 16:49:54 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4240260FC0 for ; Thu, 26 Aug 2021 16:49:53 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4240260FC0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=molgen.mpg.de Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 9210C8D0001; Thu, 26 Aug 2021 12:49:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D0A38D0002; Thu, 26 Aug 2021 12:49:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BEB38D0001; Thu, 26 Aug 2021 12:49:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0046.hostedemail.com [216.40.44.46]) by kanga.kvack.org (Postfix) with ESMTP id 5CE478D0002 for ; Thu, 26 Aug 2021 12:49:52 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 08E2A1F377 for ; Thu, 26 Aug 2021 16:49:52 +0000 (UTC) X-FDA: 78517818624.15.5992BE4 Received: from mx1.molgen.mpg.de (mx3.molgen.mpg.de [141.14.17.11]) by imf29.hostedemail.com (Postfix) with ESMTP id 19C229000265 for ; Thu, 26 Aug 2021 16:49:50 +0000 (UTC) Received: from [192.168.0.175] (ip5f5aecf9.dynamic.kabel-deutschland.de [95.90.236.249]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: buczek) by mx.molgen.mpg.de (Postfix) with ESMTPSA id 7EA6661E30B9F; Thu, 26 Aug 2021 18:49:48 +0200 (CEST) Subject: Re: Minimum inode cache size? (was: Slow file operations on file server with 30 TB hardware RAID and 100 TB software RAID) To: Paul Menzel , LKML Cc: it+linux-xfs@molgen.mpg.de, linux-fsdevel@vger.kernel.org, linux-xfs@vger.kernel.org, linux-mm@kvack.org References: <3e380495-5f85-3226-f0cf-4452e2b77ccb@molgen.mpg.de> <58e701f4-6af1-d47a-7b3e-5cadf9e27296@molgen.mpg.de> From: Donald Buczek Message-ID: <878157e2-b065-aaee-f26b-5c87e9ddc2d6@molgen.mpg.de> Date: Thu, 26 Aug 2021 18:49:48 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <58e701f4-6af1-d47a-7b3e-5cadf9e27296@molgen.mpg.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf29.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf29.hostedemail.com: domain of buczek@molgen.mpg.de designates 141.14.17.11 as permitted sender) smtp.mailfrom=buczek@molgen.mpg.de X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 19C229000265 X-Stat-Signature: 14pz5u1ghw1beirdwcsxe151byzm597i X-HE-Tag: 1629996590-213100 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 26.08.21 12:41, Paul Menzel wrote: > Dear Linux folks, >=20 >=20 > Am 20.08.21 um 16:39 schrieb Paul Menzel: >=20 >> Am 20.08.21 um 16:31 schrieb Paul Menzel: >> >>> Short problem statement: Sometimes changing into a directory on a fil= e server wit 30 TB hardware RAID and 100 TB software RAID both formatted = with XFS takes several seconds. >>> >>> >>> On a Dell PowerEdge T630 with two Xeon CPU E5-2603 v4 @ 1.70GHz and 9= 6 GB RAM a 30 TB hardware RAID is served by the hardware RAID controller = and a 100 TB MDRAID software RAID connected to a Microchip 1100-8e both f= ormatted using XFS. Currently, Linux 5.4.39 runs on it. >>> >>> ``` >>> $ more /proc/version >>> Linux version 5.4.39.mx64.334 (root@lol.molgen.mpg.de) (gcc version 7= .5.0 (GCC)) #1 SMP Thu May 7 14:27:50 CEST 2020 >>> $ dmesg | grep megar >>> [=C2=A0=C2=A0 10.322823] megaraid cmm: 2.20.2.7 (Release Date: Sun Ju= l 16 00:01:03 EST 2006) >>> [=C2=A0=C2=A0 10.331910] megaraid: 2.20.5.1 (Release Date: Thu Nov 16= 15:32:35 EST 2006) >>> [=C2=A0=C2=A0 10.345055] megaraid_sas 0000:03:00.0: BAR:0x1=C2=A0 BAR= 's base_addr(phys):0x0000000092100000=C2=A0 mapped virt_addr:0x0000000059= ea5995 >>> [=C2=A0=C2=A0 10.345057] megaraid_sas 0000:03:00.0: FW now in Ready s= tate >>> [=C2=A0=C2=A0 10.351868] megaraid_sas 0000:03:00.0: 63 bit DMA mask a= nd 32 bit consistent mask >>> [=C2=A0=C2=A0 10.361655] megaraid_sas 0000:03:00.0: firmware supports= msix=C2=A0=C2=A0=C2=A0 : (96) >>> [=C2=A0=C2=A0 10.369433] megaraid_sas 0000:03:00.0: requested/availab= le msix 13/13 >>> [=C2=A0=C2=A0 10.377113] megaraid_sas 0000:03:00.0: current msix/onli= ne cpus=C2=A0=C2=A0=C2=A0 : (13/12) >>> [=C2=A0=C2=A0 10.385190] megaraid_sas 0000:03:00.0: RDPQ mode=C2=A0=C2= =A0=C2=A0 : (disabled) >>> [=C2=A0=C2=A0 10.392092] megaraid_sas 0000:03:00.0: Current firmware = supports maximum commands: 928=C2=A0=C2=A0=C2=A0=C2=A0 LDIO threshold: 0 >>> [=C2=A0=C2=A0 10.403895] megaraid_sas 0000:03:00.0: Configured max fi= rmware commands: 927 >>> [=C2=A0=C2=A0 10.416840] megaraid_sas 0000:03:00.0: Performance mode = :Latency >>> [=C2=A0=C2=A0 10.424029] megaraid_sas 0000:03:00.0: FW supports sync = cache=C2=A0=C2=A0=C2=A0 : No >>> [=C2=A0=C2=A0 10.431417] megaraid_sas 0000:03:00.0: megasas_disable_i= ntr_fusion is called outbound_intr_mask:0x40000009 >>> [=C2=A0=C2=A0 10.486158] megaraid_sas 0000:03:00.0: FW provided suppo= rtMaxExtLDs: 1=C2=A0=C2=A0=C2=A0 max_lds: 64 >>> [=C2=A0=C2=A0 10.495502] megaraid_sas 0000:03:00.0: controller type=C2= =A0=C2=A0=C2=A0 : MR(2048MB) >>> [=C2=A0=C2=A0 10.502988] megaraid_sas 0000:03:00.0: Online Controller= Reset(OCR)=C2=A0=C2=A0=C2=A0 : Enabled >>> [=C2=A0=C2=A0 10.511445] megaraid_sas 0000:03:00.0: Secure JBOD suppo= rt=C2=A0=C2=A0=C2=A0 : No >>> [=C2=A0=C2=A0 10.518543] megaraid_sas 0000:03:00.0: NVMe passthru sup= port=C2=A0=C2=A0=C2=A0 : No >>> [=C2=A0=C2=A0 10.525834] megaraid_sas 0000:03:00.0: FW provided TM Ta= skAbort/Reset timeout: 0 secs/0 secs >>> [=C2=A0=C2=A0 10.536251] megaraid_sas 0000:03:00.0: JBOD sequence map= support=C2=A0=C2=A0=C2=A0 : No >>> [=C2=A0=C2=A0 10.543931] megaraid_sas 0000:03:00.0: PCI Lane Marginin= g support=C2=A0=C2=A0=C2=A0 : No >>> [=C2=A0=C2=A0 10.574406] megaraid_sas 0000:03:00.0: megasas_enable_in= tr_fusion is called outbound_intr_mask:0x40000000 >>> [=C2=A0=C2=A0 10.585995] megaraid_sas 0000:03:00.0: INIT adapter done >>> [=C2=A0=C2=A0 10.592409] megaraid_sas 0000:03:00.0: JBOD sequence map= is disabled megasas_setup_jbod_map 5660 >>> [=C2=A0=C2=A0 10.603273] megaraid_sas 0000:03:00.0: pci id=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 : (0x1000)/(0x005d)/(0x1028)/(0x1f42) >>> [=C2=A0=C2=A0 10.612815] megaraid_sas 0000:03:00.0: unevenspan suppor= t=C2=A0=C2=A0=C2=A0 : yes >>> [=C2=A0=C2=A0 10.619919] megaraid_sas 0000:03:00.0: firmware crash du= mp=C2=A0=C2=A0=C2=A0 : no >>> [=C2=A0=C2=A0 10.627013] megaraid_sas 0000:03:00.0: JBOD sequence map= =C2=A0=C2=A0=C2=A0 : disabled >>> $ dmesg | grep 1100-8e >>> [=C2=A0=C2=A0 25.853170] smartpqi 0000:84:00.0: added 11:2:0:0 000000= 0000000000 RAID=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 Adaptec=C2=A0 1100-8e >>> [=C2=A0=C2=A0 25.867069] scsi 11:2:0:0: RAID=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Adaptec=C2=A0 1100-8e= =C2=A02.93 PQ: 0 ANSI: 5 >>> $ xfs_info /dev/sdc >>> meta-data=3D/dev/sdc=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 isize=3D512=C2=A0=C2=A0=C2=A0 agcount=3D= 28, agsize=3D268435455 blks >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sectsz=3D512=C2=A0=C2=A0 at= tr=3D2, projid32bit=3D1 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 crc=3D1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 finobt=3D1, sparse=3D0, rmapbt=3D0 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 reflink=3D0 >>> data=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 bsize=3D4096=C2=A0=C2=A0 blocks=3D7323648000, imaxpct=3D= 5 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sunit=3D0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 swidth=3D0 blks >>> naming=C2=A0=C2=A0 =3Dversion 2=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bsize=3D4096=C2=A0=C2=A0 ascii-ci= =3D0, ftype=3D1 >>> log=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3Dinternal log=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bsize=3D4096=C2=A0=C2=A0 blocks=3D= 521728, version=3D2 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sectsz=3D512=C2=A0=C2=A0 su= nit=3D0 blks, lazy-count=3D1 >>> realtime =3Dnone=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 extsz=3D4096=C2=A0= =C2=A0 blocks=3D0, rtextents=3D0 >>> $ xfs_info /dev/md0 >>> meta-data=3D/dev/md0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 isize=3D512=C2=A0=C2=A0=C2=A0 agcount=3D= 102, agsize=3D268435328 blks >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sectsz=3D4096=C2=A0 attr=3D= 2, projid32bit=3D1 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 crc=3D1=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 finobt=3D1, sparse=3D0, rmapbt=3D0 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 reflink=3D0 >>> data=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 bsize=3D4096=C2=A0=C2=A0 blocks=3D27348633088, imaxpct= =3D1 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sunit=3D128=C2=A0=C2=A0=C2=A0= swidth=3D1792 blks >>> naming=C2=A0=C2=A0 =3Dversion 2=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bsize=3D4096=C2=A0=C2=A0 ascii-ci= =3D0, ftype=3D1 >>> log=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3Dinternal log=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 bsize=3D4096=C2=A0=C2=A0 blocks=3D= 521728, version=3D2 >>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 =3D=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sectsz=3D4096=C2=A0 sunit=3D= 1 blks, lazy-count=3D1 >>> realtime =3Dnone=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 extsz=3D4096=C2=A0= =C2=A0 blocks=3D0, rtextents=3D0 >>> $ df -i /dev/sdc >>> Filesystem=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Inodes=C2=A0= =C2=A0 IUsed=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 IFree IUse% Mounted on >>> /dev/sdc=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2929459200 4985849 29244= 73351=C2=A0=C2=A0=C2=A0 1% /home/pmenzel >>> $ df -i /dev/md0 >>> Filesystem=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 Inodes=C2=A0= =C2=A0 IUsed=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 IFree IUse% Mounted on >>> /dev/md0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2187890624 5331603 21825= 59021=C2=A0=C2=A0=C2=A0 1% /jbod/M8015 >>> ``` >>> >>> After not using a directory for a while (over 24 hours), changing int= o it (locally) takes over five seconds or doing some git operations. For = example the Linux kernel source git tree located in my home directory. (M= y shell has some git integration showing the branch name in the prompt (`= /usr/share/git-contrib/completion/git-prompt.sh`.) Once in that directory= , everything reacts instantly again. When waiting the Linux pressure stal= l information (PSI) shows IO resource contention. >>> >>> Before: >>> >>> =C2=A0=C2=A0=C2=A0=C2=A0 $ grep -R . /proc/pressure/ >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/io:some avg10=3D0.40 avg60=3D= 0.10 avg300=3D0.10 total=3D48330841502 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/io:full avg10=3D0.40 avg60=3D= 0.10 avg300=3D0.10 total=3D48067233340 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/cpu:some avg10=3D0.00 avg60=3D= 0.00 avg300=3D0.00 total=3D755842910 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/memory:some avg10=3D0.00 avg6= 0=3D0.00 avg300=3D0.00 total=3D2530206336 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/memory:full avg10=3D0.00 avg6= 0=3D0.00 avg300=3D0.00 total=3D2318140732 >>> >>> During `git log stable/linux-5.10.y`: >>> >>> =C2=A0=C2=A0=C2=A0=C2=A0 $ grep -R . /proc/pressure/ >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/io:some avg10=3D26.20 avg60=3D= 9.72 avg300=3D2.37 total=3D48337351849 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/io:full avg10=3D26.20 avg60=3D= 9.72 avg300=3D2.37 total=3D48073742033 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/cpu:some avg10=3D0.00 avg60=3D= 0.00 avg300=3D0.00 total=3D755843898 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/memory:some avg10=3D0.00 avg6= 0=3D0.00 avg300=3D0.00 total=3D2530209046 >>> =C2=A0=C2=A0=C2=A0=C2=A0 /proc/pressure/memory:full avg10=3D0.00 avg6= 0=3D0.00 avg300=3D0.00 total=3D2318143440 >>> >>> The current explanation is, that over night several maintenance scrip= ts like backup/mirroring and accounting scripts are run, which touch all = files on the devices. Additionally sometimes other users run cluster jobs= with millions of files on the software RAID. Such things invalidate the = inode cache, and =E2=80=9Cmy=E2=80=9D are thrown out. When I use it after= ward it=E2=80=99s slow in the beginning. There is still free memory durin= g these times according to `top`. >> >> =C2=A0=C2=A0=C2=A0=C2=A0 $ free -h >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 total=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 used=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 free=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 shared=C2=A0 buff/cache available >> =C2=A0=C2=A0=C2=A0=C2=A0 Mem:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 94G=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8.= 3G=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 5.3G=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0 2.3M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8= 0G =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 83G >> =C2=A0=C2=A0=C2=A0=C2=A0 Swap:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 0B=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 0B=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0B >> >>> Does that sound reasonable with ten million inodes? Is that easily ve= rifiable? >> >> If an inode consume 512 bytes with ten million inodes, that would be a= round 500 MB, which should easily fit into the cache, so it does not need= to be invalidated? >=20 > Something is wrong with that calculation, and the cache size is much bi= gger. >=20 > Looking into `/proc/slabinfo` and XFS=E2=80=99 runtime/internal statist= ics [1], it turns out that the inode cache is likely the problem. >=20 > XFS=E2=80=99 internal stats show that only one third of the inodes requ= ests are answered from cache. >=20 > =C2=A0=C2=A0=C2=A0 $ grep ^ig /sys/fs/xfs/stats/stats > =C2=A0=C2=A0=C2=A0 ig 1791207386 647353522 20111 1143854223 394 114208= 0045 10683174 >=20 > During the problematic time, the SLAB size is around 4 GB and, accordin= g to slabinfo, the inode cache only has around 200.000 (sometimes even as= low as 50.000). >=20 > =C2=A0=C2=A0=C2=A0 $ sudo grep inode /proc/slabinfo > =C2=A0=C2=A0=C2=A0 nfs_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= 16=C2=A0=C2=A0=C2=A0=C2=A0 24=C2=A0=C2=A0 1064=C2=A0=C2=A0=C2=A0 3=C2=A0= =C2=A0=C2=A0 1 : tunables=C2=A0=C2=A0 24 12=C2=A0=C2=A0=C2=A0 8 : slabdat= a=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 8=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 rpc_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= 94=C2=A0=C2=A0=C2=A0 138=C2=A0=C2=A0=C2=A0 640=C2=A0=C2=A0=C2=A0 6=C2=A0= =C2=A0=C2=A0 1 : tunables=C2=A0=C2=A0 54 27=C2=A0=C2=A0=C2=A0 8 : slabdat= a=C2=A0=C2=A0=C2=A0=C2=A0 23=C2=A0=C2=A0=C2=A0=C2=A0 23=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 mqueue_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 4=C2=A0=C2=A0=C2=A0 896=C2=A0=C2=A0=C2=A0 4=C2= =A0=C2=A0=C2=A0 1 : tunables=C2=A0=C2=A0 54 =C2=A027=C2=A0=C2=A0=C2=A0 8 = : slabdata=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = 1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 xfs_inode=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 1693683 1722284=C2=A0=C2=A0=C2=A0 960=C2=A0=C2=A0=C2=A0 4=C2=A0=C2= =A0=C2=A0 1 : tunables=C2=A0=C2=A0 54 =C2=A0 27=C2=A0=C2=A0=C2=A0 8 : sla= bdata 430571 430571=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 ext2_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0 768=C2=A0=C2=A0=C2=A0= 5=C2=A0=C2=A0=C2=A0 1 : tunables=C2=A0=C2=A0 54 27=C2=A0=C2=A0=C2=A0 8 := slabdata=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 reiser_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0 760=C2=A0=C2=A0=C2=A0 5=C2= =A0=C2=A0=C2=A0 1 : tunables=C2=A0=C2=A0 54 =C2=A027=C2=A0=C2=A0=C2=A0 8 = : slabdata=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = 0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 hugetlbfs_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= 2=C2=A0=C2=A0=C2=A0=C2=A0 12=C2=A0=C2=A0=C2=A0 608=C2=A0=C2=A0=C2=A0 6=C2= =A0=C2=A0=C2=A0 1 : tunables 54=C2=A0=C2=A0 27=C2=A0=C2=A0=C2=A0 8 : slab= data=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 2=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 sock_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0 346=C2=A0=C2= =A0=C2=A0 670=C2=A0=C2=A0=C2=A0 768=C2=A0=C2=A0=C2=A0 5=C2=A0=C2=A0=C2=A0= 1 : tunables=C2=A0=C2=A0 54 27=C2=A0=C2=A0=C2=A0 8 : slabdata=C2=A0=C2=A0= =C2=A0 134=C2=A0=C2=A0=C2=A0 134=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 proc_inode_cache=C2=A0=C2=A0=C2=A0=C2=A0 121=C2=A0=C2= =A0=C2=A0 288=C2=A0=C2=A0=C2=A0 656=C2=A0=C2=A0=C2=A0 6=C2=A0=C2=A0=C2=A0= 1 : tunables=C2=A0=C2=A0 54 27=C2=A0=C2=A0=C2=A0 8 : slabdata=C2=A0=C2=A0= =C2=A0=C2=A0 48=C2=A0=C2=A0=C2=A0=C2=A0 48=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = 0 > =C2=A0=C2=A0=C2=A0 shmem_inode_cache=C2=A0=C2=A0 2249=C2=A0=C2=A0 2827= =C2=A0=C2=A0=C2=A0 696=C2=A0=C2=A0 11=C2=A0=C2=A0=C2=A0 2 : tunables=C2=A0= =C2=A0 54 27=C2=A0=C2=A0=C2=A0 8 : slabdata=C2=A0=C2=A0=C2=A0 257=C2=A0=C2= =A0=C2=A0 257=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 > =C2=A0=C2=A0=C2=A0 inode_cache=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 209= 098 209482=C2=A0=C2=A0=C2=A0 584=C2=A0=C2=A0=C2=A0 7=C2=A0=C2=A0=C2=A0 1 = : tunables=C2=A0=C2=A0 54 27=C2=A0=C2=A0=C2=A0 8 : slabdata=C2=A0 29926=C2= =A0 29926=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 0 >=20 > (What is the difference between `xfs_inode` and `inode_cache`?) >=20 > Then going through all the files with `find -ls`, the inode cache grows= to four to five million and the SLAB size grows to around 8 GB. Over nig= ht it shrinks back to the numbers above and the page cache grows back. Maybe this demonstrates what is is probably happening: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D #! /usr/bin/bash cd /amd/claptrap/1/tmp if [ ! -d many-files ]; then mkdir -p many-files for i in $(seq -w 5); do mkdir many-files/$i for j in $(seq -w 1000); do mkdir -p many-files/$i/$j for k in $(seq -w 1000); do touch many-files/$i/$j/$k done done done fi test -e big-file.dat || fallocate -l $((600*1024*1024*1024)) big-file.dat echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null echo "# Start:" grep -E "^(MemTotal|MemFree|Cached|Active\(file\)|Inactive\(file\)|Slab):= " /proc/meminfo sudo grep xfs_inode /proc/slabinfo find many-files -ls > /dev/null echo "# After walking many files :" grep -E "^(MemTotal|MemFree|Cached|Active\(file\)|Inactive\(file\)|Slab):= " /proc/meminfo sudo grep xfs_inode /proc/slabinfo cat big-file.dat > /dev/null echo "# After reading big file:" grep -E "^(MemTotal|MemFree|Cached|Active\(file\)|Inactive\(file\)|Slab):= " /proc/meminfo sudo grep xfs_inode /proc/slabinfo =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D Output: # Start: MemTotal: 98634372 kB MemFree: 97586092 kB Cached: 115184 kB Active(file): 100992 kB Inactive(file): 8984 kB Slab: 334300 kB xfs_inode 1329 2272 960 4 1 : tunables 54 27 = 8 : slabdata 568 568 333 # After walking many files : MemTotal: 98634372 kB MemFree: 88795708 kB Cached: 138024 kB Active(file): 106740 kB Inactive(file): 28176 kB Slab: 6445960 kB xfs_inode 5006003 5006008 960 4 1 : tunables 54 27 = 8 : slabdata 1251502 1251502 0 # After reading big file: MemTotal: 98634372 kB MemFree: 495240 kB Cached: 95767564 kB Active(file): 109404 kB Inactive(file): 95655164 kB Slab: 1693884 kB xfs_inode 67714 68324 960 4 1 : tunables 54 27 = 8 : slabdata 17081 17081 243 So reading just one single file, which is bigger then the memory of the s= ystem, reads the file data through the page cache and shrinks the slabs b= y the way and the valuable vfs cache is lost. Instead, the memory is fill= ed with the tail of the big file, which wasn't even helpful if the file w= as read again. > In the discussions [2], adji`vfs_cache_pressure` is recommended, but =E2= =80=93 besides setting it to 0 =E2=80=93 it only seems to delay the shrin= king of the cache. (As it=E2=80=99s an integer 1 is the lowest non-zero (= positive) number, which would delay it by a factor of 100. >=20 > Is there a way to specify the minimum numbers of entries in the inode c= ache, or a minimum SLAB size up to that the caches should not be decrease= d? Or limit the page cache. There was an attempt to make that possible [1], but it looks like it didn= 't get anywhere. [1]: https://lwn.net/Articles/602424/ Best Donald > Kind regards, >=20 > Paul >=20 >=20 > [1]: https://xfs.org/index.php/Runtime_Stats#ig > [2]: https://linux-xfs.oss.sgi.narkive.com/qa0AYeBS/improving-xfs-file-= system-inode-performance > =C2=A0=C2=A0=C2=A0=C2=A0 "Improving XFS file system inode performance"= from 2010