From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yymi=G4=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,
	MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,
	USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EFCA3C433E6
	for <linux-mm@archiver.kernel.org>; Mon, 25 Jan 2021 18:57:26 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 88A9C206B2
	for <linux-mm@archiver.kernel.org>; Mon, 25 Jan 2021 18:57:26 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 88A9C206B2
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id D4E536B008C; Mon, 25 Jan 2021 13:57:25 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CFEBC8D001B; Mon, 25 Jan 2021 13:57:25 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BEE4C8D0001; Mon, 25 Jan 2021 13:57:25 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0027.hostedemail.com [216.40.44.27])
	by kanga.kvack.org (Postfix) with ESMTP id A8ED36B008C
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 13:57:25 -0500 (EST)
Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 545538249980
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 18:57:25 +0000 (UTC)
X-FDA: 77745205650.28.vase70_21010c227588
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin28.hostedemail.com (Postfix) with ESMTP id 24E276C2F
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 18:57:25 +0000 (UTC)
X-HE-Tag: vase70_21010c227588
X-Filterd-Recvd-Size: 10864
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf24.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 25 Jan 2021 18:57:24 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1611601044;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=X1pBji8iyq882J8X53CipbgaCEpCNxDJWCkFEorFk7A=;
	b=d1kO77XK16AwbeSqZs3XLVz048KTd4pVXjcNkqKQQA6WratI8nPABhRqaNmbqaNmTsjiOS
	hITivPmDuv0ARqQ2wEn0/DBL3PtwLaoy4YMAQDp0muW6rXkwv3AIzQVEgjgwZ67B/wsliT
	J97QcEatt7kU40EJk163IKV/rGR1ZVQ=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-413-bqoeH36UPcqBUW4nE2E8KQ-1; Mon, 25 Jan 2021 13:57:21 -0500
X-MC-Unique: bqoeH36UPcqBUW4nE2E8KQ-1
Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 33335180A096;
	Mon, 25 Jan 2021 18:57:20 +0000 (UTC)
Received: from llong.remote.csb (ovpn-117-163.rdu2.redhat.com [10.10.117.163])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 0D97B5D6DC;
	Mon, 25 Jan 2021 18:57:18 +0000 (UTC)
Subject: Re: [PATCH] mm/filemap: Adding missing mem_cgroup_uncharge() to
 __add_to_page_cache_locked()
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>, Matthew Wilcox <willy@infradead.org>,
 Andrew Morton <akpm@linux-foundation.org>,
 Alex Shi <alex.shi@linux.alibaba.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
References: <20210125042441.20030-1-longman@redhat.com>
 <20210125092815.GB827@dhcp22.suse.cz>
 <de87d009-985a-87d3-08fb-c688e23d60a9@redhat.com>
 <20210125160328.GP827@dhcp22.suse.cz>
 <20210125162506.GF308988@casper.infradead.org>
 <20210125164118.GS827@dhcp22.suse.cz> <20210125181436.GV827@dhcp22.suse.cz>
 <53eb7692-e559-a914-e103-adfe951d7a7c@redhat.com>
 <YA8TcICO1OpFwKsj@cmpxchg.org>
From: Waiman Long <longman@redhat.com>
Organization: Red Hat
Message-ID: <bbc6c5d0-bcc9-f538-af4c-166b0d2d1c04@redhat.com>
Date: Mon, 25 Jan 2021 13:57:18 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <YA8TcICO1OpFwKsj@cmpxchg.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 1/25/21 1:52 PM, Johannes Weiner wrote:
> On Mon, Jan 25, 2021 at 01:23:58PM -0500, Waiman Long wrote:
>> On 1/25/21 1:14 PM, Michal Hocko wrote:
>>> On Mon 25-01-21 17:41:19, Michal Hocko wrote:
>>>> On Mon 25-01-21 16:25:06, Matthew Wilcox wrote:
>>>>> On Mon, Jan 25, 2021 at 05:03:28PM +0100, Michal Hocko wrote:
>>>>>> On Mon 25-01-21 10:57:54, Waiman Long wrote:
>>>>>>> On 1/25/21 4:28 AM, Michal Hocko wrote:
>>>>>>>> On Sun 24-01-21 23:24:41, Waiman Long wrote:
>>>>>>>>> The commit 3fea5a499d57 ("mm: memcontrol: convert page
>>>>>>>>> cache to a new mem_cgroup_charge() API") introduced a bug in
>>>>>>>>> __add_to_page_cache_locked() causing the following splat:
>>>>>>>>>
>>>>>>>>>     [ 1570.068330] page dumped because: VM_BUG_ON_PAGE(page_mem=
cg(page))
>>>>>>>>>     [ 1570.068333] pages's memcg:ffff8889a4116000
>>>>>>>>>     [ 1570.068343] ------------[ cut here ]------------
>>>>>>>>>     [ 1570.068346] kernel BUG at mm/memcontrol.c:2924!
>>>>>>>>>     [ 1570.068355] invalid opcode: 0000 [#1] SMP KASAN PTI
>>>>>>>>>     [ 1570.068359] CPU: 35 PID: 12345 Comm: cat Tainted: G S   =
   W I       5.11.0-rc4-debug+ #1
>>>>>>>>>     [ 1570.068363] Hardware name: HP HP Z8 G4 Workstation/81C7,=
 BIOS P60 v01.25 12/06/2017
>>>>>>>>>     [ 1570.068365] RIP: 0010:commit_charge+0xf4/0x130
>>>>>>>>>       :
>>>>>>>>>     [ 1570.068375] RSP: 0018:ffff8881b38d70e8 EFLAGS: 00010286
>>>>>>>>>     [ 1570.068379] RAX: 0000000000000000 RBX: ffffea00260ddd00 =
RCX: 0000000000000027
>>>>>>>>>     [ 1570.068382] RDX: 0000000000000000 RSI: 0000000000000004 =
RDI: ffff88907ebe05a8
>>>>>>>>>     [ 1570.068384] RBP: ffffea00260ddd00 R08: ffffed120fd7c0b6 =
R09: ffffed120fd7c0b6
>>>>>>>>>     [ 1570.068386] R10: ffff88907ebe05ab R11: ffffed120fd7c0b5 =
R12: ffffea00260ddd38
>>>>>>>>>     [ 1570.068389] R13: ffff8889a4116000 R14: ffff8889a4116000 =
R15: 0000000000000001
>>>>>>>>>     [ 1570.068391] FS:  00007ff039638680(0000) GS:ffff88907ea00=
000(0000) knlGS:0000000000000000
>>>>>>>>>     [ 1570.068394] CS:  0010 DS: 0000 ES: 0000 CR0: 00000000800=
50033
>>>>>>>>>     [ 1570.068396] CR2: 00007f36f354cc20 CR3: 00000008a0126006 =
CR4: 00000000007706e0
>>>>>>>>>     [ 1570.068398] DR0: 0000000000000000 DR1: 0000000000000000 =
DR2: 0000000000000000
>>>>>>>>>     [ 1570.068400] DR3: 0000000000000000 DR6: 00000000fffe0ff0 =
DR7: 0000000000000400
>>>>>>>>>     [ 1570.068402] PKRU: 55555554
>>>>>>>>>     [ 1570.068404] Call Trace:
>>>>>>>>>     [ 1570.068407]  mem_cgroup_charge+0x175/0x770
>>>>>>>>>     [ 1570.068413]  __add_to_page_cache_locked+0x712/0xad0
>>>>>>>>>     [ 1570.068439]  add_to_page_cache_lru+0xc5/0x1f0
>>>>>>>>>     [ 1570.068461]  cachefiles_read_or_alloc_pages+0x895/0x2e10=
 [cachefiles]
>>>>>>>>>     [ 1570.068524]  __fscache_read_or_alloc_pages+0x6c0/0xa00 [=
fscache]
>>>>>>>>>     [ 1570.068540]  __nfs_readpages_from_fscache+0x16d/0x630 [n=
fs]
>>>>>>>>>     [ 1570.068585]  nfs_readpages+0x24e/0x540 [nfs]
>>>>>>>>>     [ 1570.068693]  read_pages+0x5b1/0xc40
>>>>>>>>>     [ 1570.068711]  page_cache_ra_unbounded+0x460/0x750
>>>>>>>>>     [ 1570.068729]  generic_file_buffered_read_get_pages+0x290/=
0x1710
>>>>>>>>>     [ 1570.068756]  generic_file_buffered_read+0x2a9/0xc30
>>>>>>>>>     [ 1570.068832]  nfs_file_read+0x13f/0x230 [nfs]
>>>>>>>>>     [ 1570.068872]  new_sync_read+0x3af/0x610
>>>>>>>>>     [ 1570.068901]  vfs_read+0x339/0x4b0
>>>>>>>>>     [ 1570.068909]  ksys_read+0xf1/0x1c0
>>>>>>>>>     [ 1570.068920]  do_syscall_64+0x33/0x40
>>>>>>>>>     [ 1570.068926]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>>>>>     [ 1570.068930] RIP: 0033:0x7ff039135595
>>>>>>>>>
>>>>>>>>> Before that commit, there was a try_charge() and commit_charge(=
)
>>>>>>>>> in __add_to_page_cache_locked(). These 2 separated charge funct=
ions
>>>>>>>>> were replaced by a single mem_cgroup_charge(). However, it forg=
ot
>>>>>>>>> to add a matching mem_cgroup_uncharge() when the xarray inserti=
on
>>>>>>>>> failed with the page released back to the pool. Fix this by add=
ing a
>>>>>>>>> mem_cgroup_uncharge() call when insertion error happens.
>>>>>>>>>
>>>>>>>>> Fixes: 3fea5a499d57 ("mm: memcontrol: convert page cache to a n=
ew mem_cgroup_charge() API")
>>>>>>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>>>>> OK, this is indeed a subtle bug. The patch aimed at simplifying =
the
>>>>>>>> charge lifetime so that users do not really have to think about =
when to
>>>>>>>> uncharge as that happens when the page is freed. fscache somehow=
 breaks
>>>>>>>> that assumption because it doesn't free up pages but it keeps so=
me of
>>>>>>>> them in the cache.
>>>>>>>>
>>>>>>>> I have tried to wrap my head around the cached object life time =
in
>>>>>>>> fscache but failed and got lost in the maze. Is this the only in=
stance
>>>>>>>> of the problem? Would it make more sense to explicitly handle ch=
arges in
>>>>>>>> the fscache code or there are other potential users to fall into=
 this
>>>>>>>> trap?
>>>>>>> There may be other places that have similar problem. I focus on t=
he
>>>>>>> filemap.c case as I have a test case that can reliably produce th=
e bug
>>>>>>> splat. This patch does fix it for my test case.
>>>>>> I believe this needs a more general fix than catching a random pla=
ces
>>>>>> which you can trigger. Would it make more sense to address this at=
 the
>>>>>> fscache level and always make sure that a page returned to the poo=
l is
>>>>>> always uncharged instead?
>>>>> I believe you mean "page cache" -- there is a separate thing called
>>>>> 'fscache' which is used to cache network filesystems.
>>>> Yes, I really had fscache in mind because it does have an "unusual" =
page
>>>> life time rules.
>>>>
>>>>> I don't understand the memcg code at all, so I have no useful feedb=
ack
>>>>> on what you're saying other than this.
>>>> Well the memcg accounting rules after the rework should have simplif=
ied
>>>> the API usage for most users. You will get memory charged when it is
>>>> used and it will go away when the page is freed. If a page is not re=
ally
>>>> freed in some cases and it can be reused then it doesn't really fit =
into
>>>> this scheme automagically. I do undestand that this puts some additi=
onal
>>>> burden on those special cases. I am not really sure what is the righ=
t
>>>> way here myself but considering there might be other similar cases l=
ike
>>>> that I would lean towards special casing where the pool is implement=
ed.
>>>> I would expect there is some state to be maintain for that purpose
>>>> already.
>>> After some more thinking I've came to conclusion that the patch as
>>> proposed is the proper way forward. It is easier to follow if the
>>> unwinding of state changes are local to the function.
>> I think so. It is easier to understand if the charge and uncharge func=
tions
>> are grouped together in the same function.
>>> With the proposed simplification by Willy
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Thank for the ack. However, I am a bit confused about what you mean by
>> simplification. There is another linux-next patch that changes the con=
dition
>> for mem_cgroup_charge() to
>>
>> -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!huge) {
>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (!huge && !page_is_secretmem(=
page)) {
>>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0 error =3D mem_cgroup_charge(page, current->mm, gfp)=
;
>>
>> That is the main reason why I introduced the boolean variable as I don=
't
>> want to call the external page_is_secretmem() function twice.
> The variable works for me.
>
> On the other hand, as Michal points out, the uncharge function will be
> called again on the page when it's being freed (in non-fscache cases),
> so you're already relying on being able to call it on any page -
> charged, uncharged, never charged. It would be fine to call it
> unconditionally in the error path. Aesthetic preference, I guess.

That may be true. However, I haven't fully studied how the huge page=20
memory accounting work to make sure the uncharge function can be called=20
for huge pages. So I will keep the current code for now.

Thanks,
Longman