From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.5 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A62DC433E0 for ; Sun, 19 Jul 2020 03:56:01 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B1906207DA for ; Sun, 19 Jul 2020 03:56:00 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B1906207DA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id ECFEA6B0003; Sat, 18 Jul 2020 23:55:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E80AC6B0005; Sat, 18 Jul 2020 23:55:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6F718D0001; Sat, 18 Jul 2020 23:55:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0083.hostedemail.com [216.40.44.83]) by kanga.kvack.org (Postfix) with ESMTP id C1F026B0003 for ; Sat, 18 Jul 2020 23:55:59 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 41837BBC3 for ; Sun, 19 Jul 2020 03:55:59 +0000 (UTC) X-FDA: 77053462038.23.part87_49157de26f19 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id 063D817EE7 for ; Sun, 19 Jul 2020 03:55:59 +0000 (UTC) X-HE-Tag: part87_49157de26f19 X-Filterd-Recvd-Size: 11937 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by imf47.hostedemail.com (Postfix) with ESMTP for ; Sun, 19 Jul 2020 03:55:57 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=alex.shi@linux.alibaba.com;NM=1;PH=DS;RN=18;SR=0;TI=SMTPD_---0U36kxME_1595130948; Received: from IT-FVFX43SYHV2H.local(mailfrom:alex.shi@linux.alibaba.com fp:SMTPD_---0U36kxME_1595130948) by smtp.aliyun-inc.com(127.0.0.1); Sun, 19 Jul 2020 11:55:49 +0800 Subject: Re: [PATCH v16 16/22] mm/mlock: reorder isolation sequence during munlock To: Alexander Duyck Cc: Andrew Morton , Mel Gorman , Tejun Heo , Hugh Dickins , Konstantin Khlebnikov , Daniel Jordan , Yang Shi , Matthew Wilcox , Johannes Weiner , kbuild test robot , linux-mm , LKML , cgroups@vger.kernel.org, Shakeel Butt , Joonsoo Kim , Wei Yang , "Kirill A. Shutemov" References: <1594429136-20002-1-git-send-email-alex.shi@linux.alibaba.com> <1594429136-20002-17-git-send-email-alex.shi@linux.alibaba.com> From: Alex Shi Message-ID: <6e37ee32-c6c5-fcc5-3cad-74f7ae41fb67@linux.alibaba.com> Date: Sun, 19 Jul 2020 11:55:45 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 X-Rspamd-Queue-Id: 063D817EE7 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =E5=9C=A8 2020/7/18 =E4=B8=8A=E5=8D=884:30, Alexander Duyck =E5=86=99=E9=81= =93: > On Fri, Jul 10, 2020 at 5:59 PM Alex Shi w= rote: >> >> This patch reorder the isolation steps during munlock, move the lru lo= ck >> to guard each pages, unfold __munlock_isolate_lru_page func, to do the >> preparation for lru lock change. >> >> __split_huge_page_refcount doesn't exist, but we still have to guard >> PageMlocked and PageLRU for tail page in __split_huge_page_tail. >> >> [lkp@intel.com: found a sleeping function bug ... at mm/rmap.c] >> Signed-off-by: Alex Shi >> Cc: Kirill A. Shutemov >> Cc: Andrew Morton >> Cc: Johannes Weiner >> Cc: Matthew Wilcox >> Cc: Hugh Dickins >> Cc: linux-mm@kvack.org >> Cc: linux-kernel@vger.kernel.org >> --- >> mm/mlock.c | 93 ++++++++++++++++++++++++++++++++++-------------------= --------- >> 1 file changed, 51 insertions(+), 42 deletions(-) >> >> diff --git a/mm/mlock.c b/mm/mlock.c >> index 228ba5a8e0a5..0bdde88b4438 100644 >> --- a/mm/mlock.c >> +++ b/mm/mlock.c >> @@ -103,25 +103,6 @@ void mlock_vma_page(struct page *page) >> } >> >> /* >> - * Isolate a page from LRU with optional get_page() pin. >> - * Assumes lru_lock already held and page already pinned. >> - */ >> -static bool __munlock_isolate_lru_page(struct page *page, bool getpag= e) >> -{ >> - if (TestClearPageLRU(page)) { >> - struct lruvec *lruvec; >> - >> - lruvec =3D mem_cgroup_page_lruvec(page, page_pgdat(pag= e)); >> - if (getpage) >> - get_page(page); >> - del_page_from_lru_list(page, lruvec, page_lru(page)); >> - return true; >> - } >> - >> - return false; >> -} >> - >> -/* >> * Finish munlock after successful page isolation >> * >> * Page must be locked. This is a wrapper for try_to_munlock() >> @@ -181,6 +162,7 @@ static void __munlock_isolation_failed(struct page= *page) >> unsigned int munlock_vma_page(struct page *page) >> { >> int nr_pages; >> + bool clearlru =3D false; >> pg_data_t *pgdat =3D page_pgdat(page); >> >> /* For try_to_munlock() and to serialize with page migration *= / >> @@ -189,32 +171,42 @@ unsigned int munlock_vma_page(struct page *page) >> VM_BUG_ON_PAGE(PageTail(page), page); >> >> /* >> - * Serialize with any parallel __split_huge_page_refcount() wh= ich >> + * Serialize split tail pages in __split_huge_page_tail() whic= h >> * might otherwise copy PageMlocked to part of the tail pages = before >> * we clear it in the head page. It also stabilizes hpage_nr_p= ages(). >> */ >> + get_page(page); >=20 > I don't think this get_page() call needs to be up here. It could be > left down before we delete the page from the LRU list as it is really > needed to take a reference on the page before we call > __munlock_isolated_page(), or at least that is the way it looks to me. > By doing that you can avoid a bunch of cleanup in these exception > cases. Uh, It seems unlikely for !page->_refcount, and then got to release_pages= (), if so, get_page do could move down. Thanks >=20 >> + clearlru =3D TestClearPageLRU(page); >=20 > I'm not sure I fully understand the reason for moving this here. By > clearing this flag before you clear Mlocked does this give you some > sort of extra protection? I don't see how since Mlocked doesn't > necessarily imply the page is on LRU. >=20 Above comments give a reason for the lru_lock usage, >> + * Serialize split tail pages in __split_huge_page_tail() whic= h >> * might otherwise copy PageMlocked to part of the tail pages = before >> * we clear it in the head page. It also stabilizes hpage_nr_p= ages(). Look into the __split_huge_page_tail, there is a tiny gap between tail pa= ge get PG_mlocked, and it is added into lru list. The TestClearPageLRU could blocked memcg changes of the page from stoppin= g isolate_lru_page. >> spin_lock_irq(&pgdat->lru_lock); >> >> if (!TestClearPageMlocked(page)) { >> - /* Potentially, PTE-mapped THP: do not skip the rest P= TEs */ >> - nr_pages =3D 1; >> - goto unlock_out; >> + if (clearlru) >> + SetPageLRU(page); >> + /* >> + * Potentially, PTE-mapped THP: do not skip the rest P= TEs >> + * Reuse lock as memory barrier for release_pages raci= ng. >> + */ >> + spin_unlock_irq(&pgdat->lru_lock); >> + put_page(page); >> + return 0; >> } >> >> nr_pages =3D hpage_nr_pages(page); >> __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); >> >> - if (__munlock_isolate_lru_page(page, true)) { >> + if (clearlru) { >> + struct lruvec *lruvec; >> + >=20 > You could just place the get_page() call here. >=20 >> + lruvec =3D mem_cgroup_page_lruvec(page, page_pgdat(pag= e)); >> + del_page_from_lru_list(page, lruvec, page_lru(page)); >> spin_unlock_irq(&pgdat->lru_lock); >> __munlock_isolated_page(page); >> - goto out; >> + } else { >> + spin_unlock_irq(&pgdat->lru_lock); >> + put_page(page); >> + __munlock_isolation_failed(page); >=20 > If you move the get_page() as I suggested above there wouldn't be a > need for the put_page(). It then becomes possible to simplify the code > a bit by merging the unlock paths and doing an if/else with the > __munlock functions like so: > if (clearlru) { > ... > del_page_from_lru.. > } >=20 > spin_unlock_irq() >=20 > if (clearlru) > __munlock_isolated_page(); > else > __munlock_isolated_failed(); >=20 >> } >> - __munlock_isolation_failed(page); >> - >> -unlock_out: >> - spin_unlock_irq(&pgdat->lru_lock); >> >> -out: >> return nr_pages - 1; >> } >> >> @@ -297,34 +289,51 @@ static void __munlock_pagevec(struct pagevec *pv= ec, struct zone *zone) >> pagevec_init(&pvec_putback); >> >> /* Phase 1: page isolation */ >> - spin_lock_irq(&zone->zone_pgdat->lru_lock); >> for (i =3D 0; i < nr; i++) { >> struct page *page =3D pvec->pages[i]; >> + struct lruvec *lruvec; >> + bool clearlru; >> >> - if (TestClearPageMlocked(page)) { >> - /* >> - * We already have pin from follow_page_mask() >> - * so we can spare the get_page() here. >> - */ >> - if (__munlock_isolate_lru_page(page, false)) >> - continue; >> - else >> - __munlock_isolation_failed(page); >> - } else { >> + clearlru =3D TestClearPageLRU(page); >> + spin_lock_irq(&zone->zone_pgdat->lru_lock); >=20 > I still don't see what you are gaining by moving the bit test up to > this point. Seems like it would be better left below with the lock > just being used to prevent a possible race while you are pulling the > page out of the LRU list. >=20 the same reason as above comments mentained __split_huge_page_tail()=20 issue. >> + >> + if (!TestClearPageMlocked(page)) { >> delta_munlocked++; >> + if (clearlru) >> + SetPageLRU(page); >> + goto putback; >> + } >> + >> + if (!clearlru) { >> + __munlock_isolation_failed(page); >> + goto putback; >> } >=20 > With the other function you were processing this outside of the lock, > here you are doing it inside. It would probably make more sense here > to follow similar logic and take care of the del_page_from_lru_list > ifr clealru is set, unlock, and then if clearlru is set continue else > track the isolation failure. That way you can avoid having to use as > many jump labels. >=20 >> /* >> + * Isolate this page. >> + * We already have pin from follow_page_mask() >> + * so we can spare the get_page() here. >> + */ >> + lruvec =3D mem_cgroup_page_lruvec(page, page_pgdat(pag= e)); >> + del_page_from_lru_list(page, lruvec, page_lru(page)); >> + spin_unlock_irq(&zone->zone_pgdat->lru_lock); >> + continue; >> + >> + /* >> * We won't be munlocking this page in the next phase >> * but we still need to release the follow_page_mask() >> * pin. We cannot do it under lru_lock however. If it'= s >> * the last pin, __page_cache_release() would deadlock= . >> */ >> +putback: >> + spin_unlock_irq(&zone->zone_pgdat->lru_lock); >> pagevec_add(&pvec_putback, pvec->pages[i]); >> pvec->pages[i] =3D NULL; >> } >> + /* tempary disable irq, will remove later */ >> + local_irq_disable(); >> __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked); >> - spin_unlock_irq(&zone->zone_pgdat->lru_lock); >> + local_irq_enable(); >> >> /* Now we can release pins of pages that we are not munlocking= */ >> pagevec_release(&pvec_putback); >> -- >> 1.8.3.1 >> >>