From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) by kanga.kvack.org (Postfix) with ESMTP id 73BE08E0002 for ; Thu, 17 Jan 2019 00:07:10 -0500 (EST) Received: by mail-qk1-f198.google.com with SMTP id a199so7322811qkb.23 for ; Wed, 16 Jan 2019 21:07:10 -0800 (PST) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id d201sor47032980qka.46.2019.01.16.21.07.09 for (Google Transport Security); Wed, 16 Jan 2019 21:07:09 -0800 (PST) MIME-Version: 1.0 References: <15614FDC-198E-449B-BFAF-B00D6EF61155@bytedance.com> <97A4C2CA-97BA-46DB-964A-E44410BB1730@bytedance.com> <9B56B884-8FDD-4BB5-A6CA-AD7F84397039@bytedance.com> <99AFB530-E90E-4784-B199-FA3F91171649@bytedance.com> In-Reply-To: <99AFB530-E90E-4784-B199-FA3F91171649@bytedance.com> From: Yang Shi Date: Wed, 16 Jan 2019 21:06:56 -0800 Message-ID: Subject: Re: memory cgroup pagecache and inode problem Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: Fam Zheng Cc: cgroups@vger.kernel.org, Linux MM , tj@kernel.org, Johannes Weiner , lizefan@huawei.com, Michal Hocko , Vladimir Davydov , duanxiongchun@bytedance.com, =?UTF-8?B?5byg5rC46IKD?= , liuxiaozhou@bytedance.com On Wed, Jan 16, 2019 at 6:41 PM Fam Zheng wrote= : > > > > > On Jan 17, 2019, at 05:06, Yang Shi wrote: > > > > On Tue, Jan 15, 2019 at 7:52 PM Fam Zheng w= rote: > >> > >> > >> > >>> On Jan 16, 2019, at 08:50, Yang Shi wrote: > >>> > >>> On Thu, Jan 10, 2019 at 12:30 AM Fam Zheng wrote: > >>>> > >>>> > >>>> > >>>>> On Jan 10, 2019, at 13:36, Yang Shi wrote: > >>>>> > >>>>> On Sun, Jan 6, 2019 at 9:10 PM Fam Zheng wrote: > >>>>>> > >>>>>> > >>>>>> > >>>>>>> On Jan 5, 2019, at 03:36, Yang Shi wrote: > >>>>>>> > >>>>>>> > >>>>>>> drop_caches would drop all page caches globally. You may not want= to > >>>>>>> drop the page caches used by other memcgs. > >>>>>> > >>>>>> We=E2=80=99ve tried your async force_empty patch (with a modificat= ion to default it to true to make it transparently enabled for the sake of = testing), and for the past few days the stale mem cgroups still accumulate,= up to 40k. > >>>>>> > >>>>>> We=E2=80=99ve double checked that the force_empty routines are inv= oked when a mem cgroup is offlined. But this doesn=E2=80=99t look very effe= ctive so far. Because, once we do `echo 1 > /proc/sys/vm/drop_caches`, all = the groups immediately go away. > >>>>>> > >>>>>> This is a bit unexpected. > >>>>>> > >>>>>> Yang, could you hint what are missing in the force_empty operation= , compared to a blanket drop cache? > >>>>> > >>>>> Drop caches does invalidate pages inode by inode. But, memcg > >>>>> force_empty does call memcg direct reclaim. > >>>> > >>>> But force_empty touches things that drop_caches doesn=E2=80=99t? If = so then maybe combining both approaches is more reliable. Since like you sa= id, > >>> > >>> AFAICS, force_empty may unmap pages, but drop_caches doesn't. > >>> > >>>> dropping _all_ pages is usually too much thus not desired, we may wa= nt to somehow limit the dropped caches to those that are in the memory cgro= up in question. What do you think? > >>> > >>> This is what force_empty is supposed to do. But, as your test shows > >>> some page cache may still remain after force_empty, then cause offlin= e > >>> memcgs accumulated. I haven't figured out what happened. You may tr= y > >>> what Michal suggested. > >> > >> None of the existing patches helped so far, but we suspect that the pa= ges cannot be locked at the force_empty moment. We have being working on a = =E2=80=9Cretry=E2=80=9D patch which does solve the problem. We=E2=80=99ll d= o more tracing (to have a better understanding of the issue) and post the f= indings and/or the patch later. Thanks. > > > > You mean it solves the problem by retrying more times? Actually, I'm > > not sure if you have swap setup in your test, but force_empty does do > > swap if swap is on. This may cause it can't reclaim all the page cache > > in 5 retries. I have a patch within that series to skip swap. > > Basically yes, retrying solves the problem. But compared to immediate ret= ries, a scheduled retry in a few seconds is much more effective. This may suggest doing force_empty in a worker is more effective in fact. Not sure if this is good enough to convince Johannes or not. > > We don=E2=80=99t have swap on. > > What do you mean by 5 retries? I=E2=80=99m still a bit lost in the LRU co= de and patches. MEM_CGROUP_RECLAIM_RETRIES is 5. Yang > > > > > Yang > > > >> > >> Fam > >> > >>> > >>> Yang > >>> > >>>> > >>>> > >>>>> > >>>>> Offlined memcgs will not go away if there is still page charged. Ma= ybe > >>>>> relate to per cpu memcg stock. I recall there are some commits whic= h > >>>>> do solve the per cpu page counter cache problem. > >>>>> > >>>>> 591edfb10a94 mm: drain memcg stocks on css offlining > >>>>> d12c60f64cf8 mm: memcontrol: drain memcg stock on force_empty > >>>>> bb4a7ea2b144 mm: memcontrol: drain stocks on resize limit > >>>>> > >>>>> Not sure if they would help out. > >>>> > >>>> These are all in 4.20, which is tested but not helpful. > >>>> > >>>> Fam >