From: Yang Shi <yang.shi@linux.alibaba.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
Li Xinhai <lixinhai.lxh@gmail.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: akpm <akpm@linux-foundation.org>, mhocko <mhocko@suse.com>,
n-horiguchi <n-horiguchi@ah.jp.nec.com>
Subject: Re: [PATCH 2/2] mm/mempolicy: Skip walking HUGETLB vma if MPOL_MF_STRICT is specified alone
Date: Wed, 15 Jan 2020 13:30:58 -0800 [thread overview]
Message-ID: <9fabdc16-fc31-7e06-e306-af38850de771@linux.alibaba.com> (raw)
In-Reply-To: <c102d153-5093-94c9-7e6d-65fd6881831d@oracle.com>
[-- Attachment #1: Type: text/plain, Size: 5502 bytes --]
On 1/15/20 1:07 PM, Mike Kravetz wrote:
> Naoya, would appreciate any comments you have as this seems to be related
> to your changes to add huge page migration support.
>
> On 1/14/20 5:07 PM, Mike Kravetz wrote:
>> 1) Why does the man page say MPOL_MF_STRICT is ignored on huge page mappings?
>> - Is that leftover from the the days when huge page migration was not
>> supported?
>> - Is it just because huge page migration is more likely to fail than
>> base page migration.
>
> According to man page, mbind was added to v2.6.7 kernel. git history does
> not go back that far, so I first looked at v2.6.12.
>
> Definitions from include/linux/mempolicy.h
> ------------------------------------------
> /* Policies */
> #define MPOL_DEFAULT 0
> #define MPOL_PREFERRED 1
> #define MPOL_BIND 2
> #define MPOL_INTERLEAVE 3
>
> /* Flags for mbind */
> #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
>
> Note the MPOL_MF_STRICT would simply verify current page placement. At
> this time, hugetlb pages were not part of this verification.
>
> From v2.6.12 routine check_range()
> ...
> for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
> if (!vma->vm_next && vma->vm_end < end)
> return ERR_PTR(-EFAULT);
> if (prev && prev->vm_end < vma->vm_start)
> return ERR_PTR(-EFAULT);
> if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
> err = verify_pages(vma->vm_mm,
> vma->vm_start, vma->vm_end, nodes);
> if (err) {
> first = ERR_PTR(err);
> break;
> }
> }
> prev = vma;
> }
> ...
>
> man page history does not go back this far. The first mbind man page has
> some things that are interesting:
> 1) It DOES NOT have the note saying MPOL_MF_STRICT is ignored on huge
> page mappings (even though the code does this).
> 2) Support for huge page policy was added with 2.6.16
> 3) For MPOL_MF_STRICT, in 2.6.16 or later the kernel will also try to
> move pages to the requested node with this flag.
>
> Next look at v2.6.16 kernel
> ==========================
> This kernel introduces the MPOL_MF_MOVE* flags
> #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform
> to mapping */
> #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
>
> Once again, the up front check_range() routine will skip hugetlb vma's.
> Note that check_range() will also isolate pages. So since hugetlb vma's
> are skipped, no hugetlb pages will be isolated. Since no hugetlb pages
> are isolated, none will be on the list to be migrate. Therefore, hugetlb
> pages are effectively ignored for the new MPOL_MF_MOVE* flags. This makes
> sense as huge page migration support was not added until the v3.12 kernel.
>
> Note that at when MPOL_MF_MOVE* flags were added to the mbind man page,
> the statement "MPOL_MF_STRICT is ignored on huge page mappings right now."
> was added. This was later changed to "MPOL_MF_STRICT is ignored on huge
> page mappings."
>
> Next look at v3.12 kernel
> =========================
> The v3.12 code looks similiar to today's code. In the verify/isolation
> phase, v3.12 routine queue_pages_hugetlb_pmd_range() looks very similiar
> to queue_pages_hugetlb(). Neither will cause errors at this point in the
> process. And, hugetlb pages with a mapcount == 1 will potentially be
> isolated and added to the list of pages to be migrated. In addition, if
> the subsequent call to migrate_pages() fails to migrate ANY page, we
> return -EIO if MPOL_MF_STRICT is set. This is true even if the only page(s)
> that failed to migrate were hugetlb pages.
>
> It should also be noted that no mbind man page updates were made WRT
> hugetlb pages after migration support was added.
>
> Summary
> =======
> It 'looks' like the statement "MPOL_MF_STRICT is ignored on huge page
> mappings." is left over from the original mbind implementation. When
> the huge page migration support was added, I can not be sure if ignoring
> MPOL_MF_STRICT for huge pages during the verify/isolation phase was
> intentional. It seems like it was as the return value from
> isolate_huge_page() is ignored.
Thanks a lot for digging into the history.
>
> What should we do?
> ==================
> 1) Nothing more than optimizations by Li Xinhai. Behavior that could be
> seen as conflicting with man page has existed since v3.12 and I am
> not aware of any complaints.
> 2) In addition to optimizations by Li Xinhai, modify code to truly ignore
> MPOL_MF_STRICT for huge page mappings. This would be fairly easy to do
> after a failure of migrate_pages(). We could simply traverse the list
> of pages that were not migrated looking for any non-hugetlb page.
I don't think we can do this easily since migrate_pages() would put the
migration failed hugetlb pages back to hugepage_activelist so there
should not any hugetlb page reside on the pagelist regardless of failure
if I read the code correctly.
Other than that traversing page list to look for a certain type page
doesn't sound scalable to me.
> 3) Remove the statement "MPOL_MF_STRICT is ignored on huge page mappings."
> and modify code accordingly.
>
> My suggestion would be for 1 or 2. Thoughts?
By rethinking the history (thanks again for digging into it), it sounds
#3 should be more reasonable. It sounds like the behavior was changed
since hugetlb migration was added but the man page was not updated to
reflect the change.
[-- Attachment #2: Type: text/html, Size: 6455 bytes --]
next prev parent reply other threads:[~2020-01-15 21:31 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-01-14 9:16 [PATCH 1/2] mm/mempolicy: Checking hstate for hugetlbfs page in vma_migratable Li Xinhai
2020-01-14 9:16 ` [PATCH 2/2] mm/mempolicy: Skip walking HUGETLB vma if MPOL_MF_STRICT is specified alone Li Xinhai
2020-01-14 14:09 ` Li Xinhai
2020-01-14 18:27 ` Yang Shi
2020-01-15 1:07 ` Mike Kravetz
2020-01-15 1:24 ` Yang Shi
2020-01-15 4:28 ` Mike Kravetz
2020-01-15 5:23 ` Yang Shi
2020-01-15 7:36 ` Li Xinhai
2020-01-15 17:16 ` Yang Shi
2020-01-15 21:07 ` Mike Kravetz
2020-01-15 21:30 ` Yang Shi [this message]
2020-01-15 21:45 ` Mike Kravetz
2020-01-15 21:59 ` Yang Shi
2020-01-16 8:07 ` HORIGUCHI NAOYA(堀口 直也)
2020-01-16 15:32 ` Li Xinhai
2020-01-16 7:59 ` Michal Hocko
2020-01-16 19:22 ` Mike Kravetz
2020-01-17 2:32 ` Yang Shi
2020-01-17 2:38 ` Li Xinhai
2020-01-17 7:57 ` Michal Hocko
2020-01-17 12:05 ` Li Xinhai
2020-01-17 15:20 ` Michal Hocko
2020-01-17 15:46 ` Li Xinhai
2020-01-20 12:45 ` Michal Hocko
2020-01-21 14:15 ` Li Xinhai
2020-01-21 14:53 ` Michal Hocko
2020-01-22 13:55 ` Li Xinhai
2020-01-14 19:12 ` [PATCH 1/2] mm/mempolicy: Checking hstate for hugetlbfs page in vma_migratable Mike Kravetz
2020-01-15 1:25 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9fabdc16-fc31-7e06-e306-af38850de771@linux.alibaba.com \
--to=yang.shi@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=linux-mm@kvack.org \
--cc=lixinhai.lxh@gmail.com \
--cc=mhocko@suse.com \
--cc=mike.kravetz@oracle.com \
--cc=n-horiguchi@ah.jp.nec.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox