From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yymi=G4=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.6 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CB8EEC433E0
	for <linux-mm@archiver.kernel.org>; Mon, 25 Jan 2021 18:31:05 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 02D9922B3F
	for <linux-mm@archiver.kernel.org>; Mon, 25 Jan 2021 18:31:03 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 02D9922B3F
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 4EB3D6B0078; Mon, 25 Jan 2021 13:31:03 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 49A9E6B007B; Mon, 25 Jan 2021 13:31:03 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 362BB6B007D; Mon, 25 Jan 2021 13:31:03 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0228.hostedemail.com [216.40.44.228])
	by kanga.kvack.org (Postfix) with ESMTP id 216D06B0078
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 13:31:03 -0500 (EST)
Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id CC50E3628
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 18:31:02 +0000 (UTC)
X-FDA: 77745139164.26.play77_580814927587
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin26.hostedemail.com (Postfix) with ESMTP id ADC1F1804B669
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 18:31:02 +0000 (UTC)
X-HE-Tag: play77_580814927587
X-Filterd-Recvd-Size: 9980
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	by imf07.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 18:31:00 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding:
	Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date:
	Sender:Reply-To:Content-ID:Content-Description;
	bh=T1UWtlNZr7Dokdzr3mVGS/edHEQG76PknlPWffcpm4k=; b=ORyY0SwLbNumNsaGfMQOC1fNeF
	opNajgOJtJ0+wMvbeVy/V3NjEu9tUUUEEk1qW/KcU/0/Pb9i8NJFiqpqn599ch+76fi2ua4uji3AH
	Ze6uAWxivrLFUbg3Casz4FKJe0ZUFdeAUOKRPKqrcxWe3rfekFa93w/1DXzKWdoKLFbidQwE49mPq
	gX31XgRfdapRiEMeuujfVOxBCWMEcbC+xMi5cak4b03s8nUS7F1aqj4lhkEmbCWbdCYXa+DyBtgvV
	lxnCRviQVch2r5o62HYxuHG7ZUyYnJIMuFqFz0lkrUVV9gbYevICMNMKYcupjD+Fif5daAjRfZ7sp
	I8SBhupg==;
Received: from willy by casper.infradead.org with local (Exim 4.94 #2 (Red Hat Linux))
	id 1l46c5-004WEQ-PX; Mon, 25 Jan 2021 18:29:31 +0000
Date: Mon, 25 Jan 2021 18:29:21 +0000
From: Matthew Wilcox <willy@infradead.org>
To: Waiman Long <longman@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Alex Shi <alex.shi@linux.alibaba.com>, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/filemap: Adding missing mem_cgroup_uncharge() to
 __add_to_page_cache_locked()
Message-ID: <20210125182921.GG308988@casper.infradead.org>
References: <20210125042441.20030-1-longman@redhat.com>
 <20210125092815.GB827@dhcp22.suse.cz>
 <de87d009-985a-87d3-08fb-c688e23d60a9@redhat.com>
 <20210125160328.GP827@dhcp22.suse.cz>
 <20210125162506.GF308988@casper.infradead.org>
 <20210125164118.GS827@dhcp22.suse.cz>
 <20210125181436.GV827@dhcp22.suse.cz>
 <53eb7692-e559-a914-e103-adfe951d7a7c@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
In-Reply-To: <53eb7692-e559-a914-e103-adfe951d7a7c@redhat.com>
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Jan 25, 2021 at 01:23:58PM -0500, Waiman Long wrote:
> On 1/25/21 1:14 PM, Michal Hocko wrote:
> > On Mon 25-01-21 17:41:19, Michal Hocko wrote:
> > > On Mon 25-01-21 16:25:06, Matthew Wilcox wrote:
> > > > On Mon, Jan 25, 2021 at 05:03:28PM +0100, Michal Hocko wrote:
> > > > > On Mon 25-01-21 10:57:54, Waiman Long wrote:
> > > > > > On 1/25/21 4:28 AM, Michal Hocko wrote:
> > > > > > > On Sun 24-01-21 23:24:41, Waiman Long wrote:
> > > > > > > > The commit 3fea5a499d57 ("mm: memcontrol: convert page
> > > > > > > > cache to a new mem_cgroup_charge() API") introduced a bug=
 in
> > > > > > > > __add_to_page_cache_locked() causing the following splat:
> > > > > > > >=20
> > > > > > > >    [ 1570.068330] page dumped because: VM_BUG_ON_PAGE(pag=
e_memcg(page))
> > > > > > > >    [ 1570.068333] pages's memcg:ffff8889a4116000
> > > > > > > >    [ 1570.068343] ------------[ cut here ]------------
> > > > > > > >    [ 1570.068346] kernel BUG at mm/memcontrol.c:2924!
> > > > > > > >    [ 1570.068355] invalid opcode: 0000 [#1] SMP KASAN PTI
> > > > > > > >    [ 1570.068359] CPU: 35 PID: 12345 Comm: cat Tainted: G=
 S      W I       5.11.0-rc4-debug+ #1
> > > > > > > >    [ 1570.068363] Hardware name: HP HP Z8 G4 Workstation/=
81C7, BIOS P60 v01.25 12/06/2017
> > > > > > > >    [ 1570.068365] RIP: 0010:commit_charge+0xf4/0x130
> > > > > > > >      :
> > > > > > > >    [ 1570.068375] RSP: 0018:ffff8881b38d70e8 EFLAGS: 0001=
0286
> > > > > > > >    [ 1570.068379] RAX: 0000000000000000 RBX: ffffea00260d=
dd00 RCX: 0000000000000027
> > > > > > > >    [ 1570.068382] RDX: 0000000000000000 RSI: 000000000000=
0004 RDI: ffff88907ebe05a8
> > > > > > > >    [ 1570.068384] RBP: ffffea00260ddd00 R08: ffffed120fd7=
c0b6 R09: ffffed120fd7c0b6
> > > > > > > >    [ 1570.068386] R10: ffff88907ebe05ab R11: ffffed120fd7=
c0b5 R12: ffffea00260ddd38
> > > > > > > >    [ 1570.068389] R13: ffff8889a4116000 R14: ffff8889a411=
6000 R15: 0000000000000001
> > > > > > > >    [ 1570.068391] FS:  00007ff039638680(0000) GS:ffff8890=
7ea00000(0000) knlGS:0000000000000000
> > > > > > > >    [ 1570.068394] CS:  0010 DS: 0000 ES: 0000 CR0: 000000=
0080050033
> > > > > > > >    [ 1570.068396] CR2: 00007f36f354cc20 CR3: 00000008a012=
6006 CR4: 00000000007706e0
> > > > > > > >    [ 1570.068398] DR0: 0000000000000000 DR1: 000000000000=
0000 DR2: 0000000000000000
> > > > > > > >    [ 1570.068400] DR3: 0000000000000000 DR6: 00000000fffe=
0ff0 DR7: 0000000000000400
> > > > > > > >    [ 1570.068402] PKRU: 55555554
> > > > > > > >    [ 1570.068404] Call Trace:
> > > > > > > >    [ 1570.068407]  mem_cgroup_charge+0x175/0x770
> > > > > > > >    [ 1570.068413]  __add_to_page_cache_locked+0x712/0xad0
> > > > > > > >    [ 1570.068439]  add_to_page_cache_lru+0xc5/0x1f0
> > > > > > > >    [ 1570.068461]  cachefiles_read_or_alloc_pages+0x895/0=
x2e10 [cachefiles]
> > > > > > > >    [ 1570.068524]  __fscache_read_or_alloc_pages+0x6c0/0x=
a00 [fscache]
> > > > > > > >    [ 1570.068540]  __nfs_readpages_from_fscache+0x16d/0x6=
30 [nfs]
> > > > > > > >    [ 1570.068585]  nfs_readpages+0x24e/0x540 [nfs]
> > > > > > > >    [ 1570.068693]  read_pages+0x5b1/0xc40
> > > > > > > >    [ 1570.068711]  page_cache_ra_unbounded+0x460/0x750
> > > > > > > >    [ 1570.068729]  generic_file_buffered_read_get_pages+0=
x290/0x1710
> > > > > > > >    [ 1570.068756]  generic_file_buffered_read+0x2a9/0xc30
> > > > > > > >    [ 1570.068832]  nfs_file_read+0x13f/0x230 [nfs]
> > > > > > > >    [ 1570.068872]  new_sync_read+0x3af/0x610
> > > > > > > >    [ 1570.068901]  vfs_read+0x339/0x4b0
> > > > > > > >    [ 1570.068909]  ksys_read+0xf1/0x1c0
> > > > > > > >    [ 1570.068920]  do_syscall_64+0x33/0x40
> > > > > > > >    [ 1570.068926]  entry_SYSCALL_64_after_hwframe+0x44/0x=
a9
> > > > > > > >    [ 1570.068930] RIP: 0033:0x7ff039135595
> > > > > > > >=20
> > > > > > > > Before that commit, there was a try_charge() and commit_c=
harge()
> > > > > > > > in __add_to_page_cache_locked(). These 2 separated charge=
 functions
> > > > > > > > were replaced by a single mem_cgroup_charge(). However, i=
t forgot
> > > > > > > > to add a matching mem_cgroup_uncharge() when the xarray i=
nsertion
> > > > > > > > failed with the page released back to the pool. Fix this =
by adding a
> > > > > > > > mem_cgroup_uncharge() call when insertion error happens.
> > > > > > > >=20
> > > > > > > > Fixes: 3fea5a499d57 ("mm: memcontrol: convert page cache =
to a new mem_cgroup_charge() API")
> > > > > > > > Signed-off-by: Waiman Long <longman@redhat.com>
> > > > > > > OK, this is indeed a subtle bug. The patch aimed at simplif=
ying the
> > > > > > > charge lifetime so that users do not really have to think a=
bout when to
> > > > > > > uncharge as that happens when the page is freed. fscache so=
mehow breaks
> > > > > > > that assumption because it doesn't free up pages but it kee=
ps some of
> > > > > > > them in the cache.
> > > > > > >=20
> > > > > > > I have tried to wrap my head around the cached object life =
time in
> > > > > > > fscache but failed and got lost in the maze. Is this the on=
ly instance
> > > > > > > of the problem? Would it make more sense to explicitly hand=
le charges in
> > > > > > > the fscache code or there are other potential users to fall=
 into this
> > > > > > > trap?
> > > > > > There may be other places that have similar problem. I focus =
on the
> > > > > > filemap.c case as I have a test case that can reliably produc=
e the bug
> > > > > > splat. This patch does fix it for my test case.
> > > > > I believe this needs a more general fix than catching a random =
places
> > > > > which you can trigger. Would it make more sense to address this=
 at the
> > > > > fscache level and always make sure that a page returned to the =
pool is
> > > > > always uncharged instead?
> > > > I believe you mean "page cache" -- there is a separate thing call=
ed
> > > > 'fscache' which is used to cache network filesystems.
> > > Yes, I really had fscache in mind because it does have an "unusual"=
 page
> > > life time rules.
> > >=20
> > > > I don't understand the memcg code at all, so I have no useful fee=
dback
> > > > on what you're saying other than this.
> > > Well the memcg accounting rules after the rework should have simpli=
fied
> > > the API usage for most users. You will get memory charged when it i=
s
> > > used and it will go away when the page is freed. If a page is not r=
eally
> > > freed in some cases and it can be reused then it doesn't really fit=
 into
> > > this scheme automagically. I do undestand that this puts some addit=
ional
> > > burden on those special cases. I am not really sure what is the rig=
ht
> > > way here myself but considering there might be other similar cases =
like
> > > that I would lean towards special casing where the pool is implemen=
ted.
> > > I would expect there is some state to be maintain for that purpose
> > > already.
> > After some more thinking I've came to conclusion that the patch as
> > proposed is the proper way forward. It is easier to follow if the
> > unwinding of state changes are local to the function.
> I think so. It is easier to understand if the charge and uncharge funct=
ions
> are grouped together in the same function.
> >=20
> > With the proposed simplification by Willy
> > Acked-by: Michal Hocko <mhocko@suse.com>
>=20
> Thank for the ack. However, I am a bit confused about what you mean by
> simplification. There is another linux-next patch that changes the cond=
ition
> for mem_cgroup_charge() to
>=20
> -=A0=A0=A0=A0=A0=A0 if (!huge) {
> +=A0=A0=A0=A0=A0=A0 if (!huge && !page_is_secretmem(page)) {
> =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 error =3D mem_cgroup_char=
ge(page, current->mm, gfp);
>=20
> That is the main reason why I introduced the boolean variable as I don'=
t
> want to call the external page_is_secretmem() function twice.

See earlier emails to Mike suggesting that the accounting of secretmem
here is wrong.