From: Alex Shi <alex.shi@linux.alibaba.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, akpm@linux-foundation.org,
mgorman@techsingularity.net, tj@kernel.org, hughd@google.com,
khlebnikov@yandex-team.ru, daniel.m.jordan@oracle.com,
yang.shi@linux.alibaba.com, willy@infradead.org,
shakeelb@google.com, "Michal Hocko" <mhocko@kernel.org>,
"Vladimir Davydov" <vdavydov.dev@gmail.com>,
"Roman Gushchin" <guro@fb.com>,
"Chris Down" <chris@chrisdown.name>,
"Thomas Gleixner" <tglx@linutronix.de>,
"Vlastimil Babka" <vbabka@suse.cz>, "Qian Cai" <cai@lca.pw>,
"Andrey Ryabinin" <aryabinin@virtuozzo.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
"Jérôme Glisse" <jglisse@redhat.com>,
"Andrea Arcangeli" <aarcange@redhat.com>,
"David Rientjes" <rientjes@google.com>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
swkhack <swkhack@gmail.com>,
"Potyra, Stefan" <Stefan.Potyra@elektrobit.com>,
"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
"Stephen Rothwell" <sfr@canb.auug.org.au>,
"Colin Ian King" <colin.king@canonical.com>,
"Jason Gunthorpe" <jgg@ziepe.ca>,
"Mauro Carvalho Chehab" <mchehab+samsung@kernel.org>,
"Peng Fan" <peng.fan@nxp.com>,
"Nikolay Borisov" <nborisov@suse.com>,
"Ira Weiny" <ira.weiny@intel.com>,
"Kirill Tkhai" <ktkhai@virtuozzo.com>,
"Yafang Shao" <laoar.shao@gmail.com>
Subject: Re: [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock
Date: Mon, 13 Apr 2020 18:48:22 +0800 [thread overview]
Message-ID: <cdcdb710-1d78-6fac-48d7-35519ddcdc6a@linux.alibaba.com> (raw)
In-Reply-To: <20200116215222.GA64230@cmpxchg.org>
> In a previous review, I pointed out the following race condition
> between page charging and compaction:
>
> compaction: generic_file_buffered_read:
>
> page_cache_alloc()
>
> !PageBuddy()
>
> lock_page_lruvec(page)
> lruvec = mem_cgroup_page_lruvec()
> spin_lock(&lruvec->lru_lock)
> if lruvec != mem_cgroup_page_lruvec()
> goto again
>
> add_to_page_cache_lru()
> mem_cgroup_commit_charge()
> page->mem_cgroup = foo
> lru_cache_add()
> __pagevec_lru_add()
> SetPageLRU()
>
> if PageLRU(page):
> __isolate_lru_page()
>
> As far as I can see, you have not addressed this. You have added
> lock_page_memcg(), but that prevents charged pages from moving between
> cgroups, it does not prevent newly allocated pages from being charged.
>
> It doesn't matter how many times you check the lruvec before and after
> locking - if you're looking at a free page, it might get allocated,
> charged and put on a new lruvec after you're done checking, and then
> you isolate a page from an unlocked lruvec.
>
> You simply cannot serialize on page->mem_cgroup->lruvec when
> page->mem_cgroup isn't stable. You need to serialize on the page
> itself, one way or another, to make this work.
>
>
> So here is a crazy idea that may be worth exploring:
>
> Right now, pgdat->lru_lock protects both PageLRU *and* the lruvec's
> linked list.
>
> Can we make PageLRU atomic and use it to stabilize the lru_lock
> instead, and then use the lru_lock only serialize list operations?
>
> I.e. in compaction, you'd do
>
> if (!TestClearPageLRU(page))
> goto isolate_fail;
> /*
> * We isolated the page's LRU state and thereby locked out all
> * other isolators, including cgroup page moving, page reclaim,
> * page freeing etc. That means page->mem_cgroup is now stable
> * and we can safely look up the correct lruvec and take the
> * page off its physical LRU list.
> */
> lruvec = mem_cgroup_page_lruvec(page);
> spin_lock_irq(&lruvec->lru_lock);
> del_page_from_lru_list(page, lruvec, page_lru(page));
>
> Putback would mostly remain the same (although you could take the
> PageLRU setting out of the list update locked section, as long as it's
> set after the page is physically linked):
>
> /* LRU isolation pins page->mem_cgroup */
> lruvec = mem_cgroup_page_lruvec(page)
> spin_lock_irq(&lruvec->lru_lock);
> add_page_to_lru_list(...);
> spin_unlock_irq(&lruvec->lru_lock);
>
> SetPageLRU(page);
>
> And you'd have to carefully review and rework other sites that rely on
> PageLRU: reclaim, __page_cache_release(), __activate_page() etc.
>
> Especially things like activate_page(), which used to only check
> PageLRU to shuffle the page on the LRU list would now have to briefly
> clear PageLRU and then set it again afterwards.
>
> However, aside from a bit more churn in those cases, and the
> unfortunate additional atomic operations, I currently can't think of a
> fundamental reason why this wouldn't work.
>
> Hugh, what do you think?
>
Hi Johannes
As to the idea of TestClearPageLRU, we except the following scenario
compaction commit_charge
if (TestClearPageLRU)
!TestClearPageLRU lock_page_lruvec
goto isolate_fail; del_from_lru_list
unlock_page_lruvec
But there is a difficult situation to handle:
compaction commit_charge
TestClearPageLRU
!TestClearPageLRU
page possible state:
a, reclaiming, b, moving between lru list, c, migrating, like in compaction
d, mlocking, e, split_huge_page,
If the page lru bit was cleared in commit_charge with lrucare,
we still have no idea if the page was isolated by the reason from a~e
or the page is never on LRU, to deal with different reasons is high cost.
So as to the above issue you mentioned, Maybe the better idea is to
set lrucare when do mem_cgroup_commit_charge(), since the charge action
is not often. What's your idea of this solution?
Thanks
Alex
> compaction: generic_file_buffered_read:
>
> page_cache_alloc()
>
> !PageBuddy()
>
> lock_page_lruvec(page)
> lruvec = mem_cgroup_page_lruvec()
> spin_lock(&lruvec->lru_lock)
> if lruvec != mem_cgroup_page_lruvec()
> goto again
>
> add_to_page_cache_lru()
> mem_cgroup_commit_charge()
//do charge with lrucare
spin_lock(&lruvec->lru_lock)
> page->mem_cgroup = foo
> lru_cache_add()
> __pagevec_lru_add()
> SetPageLRU()
>
> if PageLRU(page):
> __isolate_lru_page()
next prev parent reply other threads:[~2020-04-13 10:48 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-16 3:04 [PATCH v8 00/10] per lruvec lru_lock for memcg Alex Shi
2020-01-16 3:05 ` [PATCH v8 01/10] mm/vmscan: remove unnecessary lruvec adding Alex Shi
2020-01-16 3:05 ` [PATCH v8 02/10] mm/memcg: fold lock_page_lru into commit_charge Alex Shi
2020-01-16 3:05 ` [PATCH v8 03/10] mm/lru: replace pgdat lru_lock with lruvec lock Alex Shi
2020-01-16 21:52 ` Johannes Weiner
2020-01-19 11:32 ` Alex Shi
2020-01-20 12:58 ` Alex Shi
2020-01-21 16:00 ` Johannes Weiner
2020-01-22 12:01 ` Alex Shi
2020-01-22 18:31 ` Johannes Weiner
2020-04-13 10:48 ` Alex Shi [this message]
2020-04-13 18:07 ` Johannes Weiner
2020-04-14 4:52 ` Alex Shi
2020-04-14 16:31 ` Johannes Weiner
2020-04-15 13:42 ` Alex Shi
2020-04-16 8:01 ` Alex Shi
2020-04-16 15:28 ` Johannes Weiner
2020-04-16 17:47 ` Shakeel Butt
2020-04-17 13:18 ` Alex Shi
2020-04-17 14:39 ` Alex Shi
2020-04-14 8:19 ` Alex Shi
2020-04-14 16:36 ` Johannes Weiner
2020-01-16 3:05 ` [PATCH v8 04/10] mm/lru: introduce the relock_page_lruvec function Alex Shi
2020-01-16 3:05 ` [PATCH v8 05/10] mm/mlock: optimize munlock_pagevec by relocking Alex Shi
2020-01-16 3:05 ` [PATCH v8 06/10] mm/swap: only change the lru_lock iff page's lruvec is different Alex Shi
2020-01-16 3:05 ` [PATCH v8 07/10] mm/pgdat: remove pgdat lru_lock Alex Shi
2020-01-16 3:05 ` [PATCH v8 08/10] mm/lru: revise the comments of lru_lock Alex Shi
2020-01-16 3:05 ` [PATCH v8 09/10] mm/lru: add debug checking for page memcg moving Alex Shi
2020-01-16 3:05 ` [PATCH v8 10/10] mm/memcg: add debug checking in lock_page_memcg Alex Shi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cdcdb710-1d78-6fac-48d7-35519ddcdc6a@linux.alibaba.com \
--to=alex.shi@linux.alibaba.com \
--cc=Stefan.Potyra@elektrobit.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=aryabinin@virtuozzo.com \
--cc=cai@lca.pw \
--cc=cgroups@vger.kernel.org \
--cc=chris@chrisdown.name \
--cc=colin.king@canonical.com \
--cc=daniel.m.jordan@oracle.com \
--cc=guro@fb.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=ira.weiny@intel.com \
--cc=jgg@ziepe.ca \
--cc=jglisse@redhat.com \
--cc=khlebnikov@yandex-team.ru \
--cc=kirill.shutemov@linux.intel.com \
--cc=ktkhai@virtuozzo.com \
--cc=laoar.shao@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mchehab+samsung@kernel.org \
--cc=mgorman@techsingularity.net \
--cc=mhocko@kernel.org \
--cc=nborisov@suse.com \
--cc=peng.fan@nxp.com \
--cc=rientjes@google.com \
--cc=rppt@linux.vnet.ibm.com \
--cc=sfr@canb.auug.org.au \
--cc=shakeelb@google.com \
--cc=swkhack@gmail.com \
--cc=tglx@linutronix.de \
--cc=tj@kernel.org \
--cc=vbabka@suse.cz \
--cc=vdavydov.dev@gmail.com \
--cc=willy@infradead.org \
--cc=yang.shi@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox