From: Waiman Long <longman@redhat.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>,
Matthew Wilcox <willy@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
Alex Shi <alex.shi@linux.alibaba.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/filemap: Adding missing mem_cgroup_uncharge() to __add_to_page_cache_locked()
Date: Mon, 25 Jan 2021 13:57:18 -0500 [thread overview]
Message-ID: <bbc6c5d0-bcc9-f538-af4c-166b0d2d1c04@redhat.com> (raw)
In-Reply-To: <YA8TcICO1OpFwKsj@cmpxchg.org>
On 1/25/21 1:52 PM, Johannes Weiner wrote:
> On Mon, Jan 25, 2021 at 01:23:58PM -0500, Waiman Long wrote:
>> On 1/25/21 1:14 PM, Michal Hocko wrote:
>>> On Mon 25-01-21 17:41:19, Michal Hocko wrote:
>>>> On Mon 25-01-21 16:25:06, Matthew Wilcox wrote:
>>>>> On Mon, Jan 25, 2021 at 05:03:28PM +0100, Michal Hocko wrote:
>>>>>> On Mon 25-01-21 10:57:54, Waiman Long wrote:
>>>>>>> On 1/25/21 4:28 AM, Michal Hocko wrote:
>>>>>>>> On Sun 24-01-21 23:24:41, Waiman Long wrote:
>>>>>>>>> The commit 3fea5a499d57 ("mm: memcontrol: convert page
>>>>>>>>> cache to a new mem_cgroup_charge() API") introduced a bug in
>>>>>>>>> __add_to_page_cache_locked() causing the following splat:
>>>>>>>>>
>>>>>>>>> [ 1570.068330] page dumped because: VM_BUG_ON_PAGE(page_memcg(page))
>>>>>>>>> [ 1570.068333] pages's memcg:ffff8889a4116000
>>>>>>>>> [ 1570.068343] ------------[ cut here ]------------
>>>>>>>>> [ 1570.068346] kernel BUG at mm/memcontrol.c:2924!
>>>>>>>>> [ 1570.068355] invalid opcode: 0000 [#1] SMP KASAN PTI
>>>>>>>>> [ 1570.068359] CPU: 35 PID: 12345 Comm: cat Tainted: G S W I 5.11.0-rc4-debug+ #1
>>>>>>>>> [ 1570.068363] Hardware name: HP HP Z8 G4 Workstation/81C7, BIOS P60 v01.25 12/06/2017
>>>>>>>>> [ 1570.068365] RIP: 0010:commit_charge+0xf4/0x130
>>>>>>>>> :
>>>>>>>>> [ 1570.068375] RSP: 0018:ffff8881b38d70e8 EFLAGS: 00010286
>>>>>>>>> [ 1570.068379] RAX: 0000000000000000 RBX: ffffea00260ddd00 RCX: 0000000000000027
>>>>>>>>> [ 1570.068382] RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffff88907ebe05a8
>>>>>>>>> [ 1570.068384] RBP: ffffea00260ddd00 R08: ffffed120fd7c0b6 R09: ffffed120fd7c0b6
>>>>>>>>> [ 1570.068386] R10: ffff88907ebe05ab R11: ffffed120fd7c0b5 R12: ffffea00260ddd38
>>>>>>>>> [ 1570.068389] R13: ffff8889a4116000 R14: ffff8889a4116000 R15: 0000000000000001
>>>>>>>>> [ 1570.068391] FS: 00007ff039638680(0000) GS:ffff88907ea00000(0000) knlGS:0000000000000000
>>>>>>>>> [ 1570.068394] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>>> [ 1570.068396] CR2: 00007f36f354cc20 CR3: 00000008a0126006 CR4: 00000000007706e0
>>>>>>>>> [ 1570.068398] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>>> [ 1570.068400] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>>> [ 1570.068402] PKRU: 55555554
>>>>>>>>> [ 1570.068404] Call Trace:
>>>>>>>>> [ 1570.068407] mem_cgroup_charge+0x175/0x770
>>>>>>>>> [ 1570.068413] __add_to_page_cache_locked+0x712/0xad0
>>>>>>>>> [ 1570.068439] add_to_page_cache_lru+0xc5/0x1f0
>>>>>>>>> [ 1570.068461] cachefiles_read_or_alloc_pages+0x895/0x2e10 [cachefiles]
>>>>>>>>> [ 1570.068524] __fscache_read_or_alloc_pages+0x6c0/0xa00 [fscache]
>>>>>>>>> [ 1570.068540] __nfs_readpages_from_fscache+0x16d/0x630 [nfs]
>>>>>>>>> [ 1570.068585] nfs_readpages+0x24e/0x540 [nfs]
>>>>>>>>> [ 1570.068693] read_pages+0x5b1/0xc40
>>>>>>>>> [ 1570.068711] page_cache_ra_unbounded+0x460/0x750
>>>>>>>>> [ 1570.068729] generic_file_buffered_read_get_pages+0x290/0x1710
>>>>>>>>> [ 1570.068756] generic_file_buffered_read+0x2a9/0xc30
>>>>>>>>> [ 1570.068832] nfs_file_read+0x13f/0x230 [nfs]
>>>>>>>>> [ 1570.068872] new_sync_read+0x3af/0x610
>>>>>>>>> [ 1570.068901] vfs_read+0x339/0x4b0
>>>>>>>>> [ 1570.068909] ksys_read+0xf1/0x1c0
>>>>>>>>> [ 1570.068920] do_syscall_64+0x33/0x40
>>>>>>>>> [ 1570.068926] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>>>>>>> [ 1570.068930] RIP: 0033:0x7ff039135595
>>>>>>>>>
>>>>>>>>> Before that commit, there was a try_charge() and commit_charge()
>>>>>>>>> in __add_to_page_cache_locked(). These 2 separated charge functions
>>>>>>>>> were replaced by a single mem_cgroup_charge(). However, it forgot
>>>>>>>>> to add a matching mem_cgroup_uncharge() when the xarray insertion
>>>>>>>>> failed with the page released back to the pool. Fix this by adding a
>>>>>>>>> mem_cgroup_uncharge() call when insertion error happens.
>>>>>>>>>
>>>>>>>>> Fixes: 3fea5a499d57 ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API")
>>>>>>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>>>>> OK, this is indeed a subtle bug. The patch aimed at simplifying the
>>>>>>>> charge lifetime so that users do not really have to think about when to
>>>>>>>> uncharge as that happens when the page is freed. fscache somehow breaks
>>>>>>>> that assumption because it doesn't free up pages but it keeps some of
>>>>>>>> them in the cache.
>>>>>>>>
>>>>>>>> I have tried to wrap my head around the cached object life time in
>>>>>>>> fscache but failed and got lost in the maze. Is this the only instance
>>>>>>>> of the problem? Would it make more sense to explicitly handle charges in
>>>>>>>> the fscache code or there are other potential users to fall into this
>>>>>>>> trap?
>>>>>>> There may be other places that have similar problem. I focus on the
>>>>>>> filemap.c case as I have a test case that can reliably produce the bug
>>>>>>> splat. This patch does fix it for my test case.
>>>>>> I believe this needs a more general fix than catching a random places
>>>>>> which you can trigger. Would it make more sense to address this at the
>>>>>> fscache level and always make sure that a page returned to the pool is
>>>>>> always uncharged instead?
>>>>> I believe you mean "page cache" -- there is a separate thing called
>>>>> 'fscache' which is used to cache network filesystems.
>>>> Yes, I really had fscache in mind because it does have an "unusual" page
>>>> life time rules.
>>>>
>>>>> I don't understand the memcg code at all, so I have no useful feedback
>>>>> on what you're saying other than this.
>>>> Well the memcg accounting rules after the rework should have simplified
>>>> the API usage for most users. You will get memory charged when it is
>>>> used and it will go away when the page is freed. If a page is not really
>>>> freed in some cases and it can be reused then it doesn't really fit into
>>>> this scheme automagically. I do undestand that this puts some additional
>>>> burden on those special cases. I am not really sure what is the right
>>>> way here myself but considering there might be other similar cases like
>>>> that I would lean towards special casing where the pool is implemented.
>>>> I would expect there is some state to be maintain for that purpose
>>>> already.
>>> After some more thinking I've came to conclusion that the patch as
>>> proposed is the proper way forward. It is easier to follow if the
>>> unwinding of state changes are local to the function.
>> I think so. It is easier to understand if the charge and uncharge functions
>> are grouped together in the same function.
>>> With the proposed simplification by Willy
>>> Acked-by: Michal Hocko <mhocko@suse.com>
>> Thank for the ack. However, I am a bit confused about what you mean by
>> simplification. There is another linux-next patch that changes the condition
>> for mem_cgroup_charge() to
>>
>> - if (!huge) {
>> + if (!huge && !page_is_secretmem(page)) {
>> error = mem_cgroup_charge(page, current->mm, gfp);
>>
>> That is the main reason why I introduced the boolean variable as I don't
>> want to call the external page_is_secretmem() function twice.
> The variable works for me.
>
> On the other hand, as Michal points out, the uncharge function will be
> called again on the page when it's being freed (in non-fscache cases),
> so you're already relying on being able to call it on any page -
> charged, uncharged, never charged. It would be fine to call it
> unconditionally in the error path. Aesthetic preference, I guess.
That may be true. However, I haven't fully studied how the huge page
memory accounting work to make sure the uncharge function can be called
for huge pages. So I will keep the current code for now.
Thanks,
Longman
next prev parent reply other threads:[~2021-01-25 18:57 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-01-25 4:24 Waiman Long
2021-01-25 4:36 ` Matthew Wilcox
2021-01-25 6:34 ` Miaohe Lin
2021-01-25 14:09 ` Waiman Long
2021-01-25 4:41 ` Alex Shi
2021-01-25 6:30 ` Miaohe Lin
2021-01-25 14:12 ` Waiman Long
2021-01-25 23:11 ` Andrew Morton
2021-01-25 23:13 ` Waiman Long
2021-01-25 8:07 ` Muchun Song
2021-01-25 15:35 ` Waiman Long
2021-01-25 9:28 ` Michal Hocko
2021-01-25 15:57 ` Waiman Long
2021-01-25 16:03 ` Michal Hocko
2021-01-25 16:25 ` Matthew Wilcox
2021-01-25 16:41 ` Michal Hocko
2021-01-25 18:14 ` Michal Hocko
2021-01-25 18:23 ` Waiman Long
2021-01-25 18:29 ` Matthew Wilcox
2021-01-25 18:45 ` Waiman Long
2021-01-25 18:31 ` Michal Hocko
2021-01-25 18:52 ` Johannes Weiner
2021-01-25 18:57 ` Waiman Long [this message]
2021-01-26 8:01 ` Michal Hocko
2021-01-25 18:43 ` Johannes Weiner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bbc6c5d0-bcc9-f538-af4c-166b0d2d1c04@redhat.com \
--to=longman@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@linux.alibaba.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox