Re: [PATCH] huegtlbfs: fix page leak during migration of file pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Michal Hocko <mhocko@kernel.org>,
	"Andrea Arcangeli" <aarcange@redhat.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: [PATCH] huegtlbfs: fix page leak during migration of file pages
Date: Fri, 8 Feb 2019 02:31:32 +0000	[thread overview]
Message-ID: <20190208023132.GA25778@hori1.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <917e7673-051b-e475-8711-ed012cff4c44@oracle.com>

On Thu, Feb 07, 2019 at 10:50:55AM -0800, Mike Kravetz wrote:
> On 1/30/19 1:14 PM, Mike Kravetz wrote:
> > Files can be created and mapped in an explicitly mounted hugetlbfs
> > filesystem.  If pages in such files are migrated, the filesystem
> > usage will not be decremented for the associated pages.  This can
> > result in mmap or page allocation failures as it appears there are
> > fewer pages in the filesystem than there should be.
> 
> Does anyone have a little time to take a look at this?
> 
> While migration of hugetlb pages 'should' not be a common issue, we
> have seen it happen via soft memory errors/page poisoning in production
> environments.  Didn't see a leak in that case as it was with pages in a
> Sys V shared mem segment.  However, our DB code is starting to make use
> of files in explicitly mounted hugetlbfs filesystems.  Therefore, we are
> more likely to hit this bug in the field.

Hi Mike,

Thank you for finding/reporting the problem.
# sorry for my late response.

> 
> > 
> > For example, a test program which hole punches, faults and migrates
> > pages in such a file (1G in size) will eventually fail because it
> > can not allocate a page.  Reported counts and usage at time of failure:
> > 
> > node0
> > 537	free_hugepages
> > 1024	nr_hugepages
> > 0	surplus_hugepages
> > node1
> > 1000	free_hugepages
> > 1024	nr_hugepages
> > 0	surplus_hugepages
> > 
> > Filesystem                         Size  Used Avail Use% Mounted on
> > nodev                              4.0G  4.0G     0 100% /var/opt/hugepool
> > 
> > Note that the filesystem shows 4G of pages used, while actual usage is
> > 511 pages (just under 1G).  Failed trying to allocate page 512.
> > 
> > If a hugetlb page is associated with an explicitly mounted filesystem,
> > this information in contained in the page_private field.  At migration
> > time, this information is not preserved.  To fix, simply transfer
> > page_private from old to new page at migration time if necessary. Also,
> > migrate_page_states() unconditionally clears page_private and PagePrivate
> > of the old page.  It is unlikely, but possible that these fields could
> > be non-NULL and are needed at hugetlb free page time.  So, do not touch
> > these fields for hugetlb pages.
> > 
> > Cc: <stable@vger.kernel.org>
> > Fixes: 290408d4a250 ("hugetlb: hugepage migration core")
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  fs/hugetlbfs/inode.c | 10 ++++++++++
> >  mm/migrate.c         | 10 ++++++++--
> >  2 files changed, 18 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index 32920a10100e..fb6de1db8806 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -859,6 +859,16 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
> >  	rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> >  	if (rc != MIGRATEPAGE_SUCCESS)
> >  		return rc;
> > +
> > +	/*
> > +	 * page_private is subpool pointer in hugetlb pages, transfer
> > +	 * if needed.
> > +	 */
> > +	if (page_private(page) && !page_private(newpage)) {
> > +		set_page_private(newpage, page_private(page));
> > +		set_page_private(page, 0);

You don't have to copy PagePrivate flag?

> > +	}
> > +
> >  	if (mode != MIGRATE_SYNC_NO_COPY)
> >  		migrate_page_copy(newpage, page);
> >  	else
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index f7e4bfdc13b7..0d9708803553 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -703,8 +703,14 @@ void migrate_page_states(struct page *newpage, struct page *page)
> >  	 */
> >  	if (PageSwapCache(page))
> >  		ClearPageSwapCache(page);
> > -	ClearPagePrivate(page);
> > -	set_page_private(page, 0);
> > +	/*
> > +	 * Unlikely, but PagePrivate and page_private could potentially
> > +	 * contain information needed at hugetlb free page time.
> > +	 */
> > +	if (!PageHuge(page)) {
> > +		ClearPagePrivate(page);
> > +		set_page_private(page, 0);
> > +	}

# This argument is mainly for existing code...

According to the comment on migrate_page():

    /*
     * Common logic to directly migrate a single LRU page suitable for
     * pages that do not use PagePrivate/PagePrivate2.
     *
     * Pages are locked upon entry and exit.
     */
    int migrate_page(struct address_space *mapping, ...

So this common logic assumes that page_private is not used, so why do
we explicitly clear page_private in migrate_page_states()?
buffer_migrate_page(), which is commonly used for the case when
page_private is used, does that clearing outside migrate_page_states().
So I thought that hugetlbfs_migrate_page() could do in the similar manner.
IOW, migrate_page_states() should not do anything on PagePrivate.
But there're a few other .migratepage callbacks, and I'm not sure all of
them are safe for the change, so this approach might not fit for a small fix.

# BTW, there seems a typo in $SUBJECT.

Thanks,
Naoya Horiguchi

next prev parent reply	other threads:[~2019-02-08  2:33 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-30 21:14 Mike Kravetz
2019-01-31 14:12 ` Sasha Levin
2019-02-01 22:36   ` Mike Kravetz
2019-02-07 18:50 ` Mike Kravetz
2019-02-08  2:31   ` Naoya Horiguchi [this message]
2019-02-08  5:50     ` Mike Kravetz
2019-02-08  7:31       ` Naoya Horiguchi
2019-02-11 23:06         ` Mike Kravetz
2019-02-12  2:24           ` Naoya Horiguchi
2019-02-12  2:37             ` Mike Kravetz
2019-02-12 22:14               ` [PATCH] huegtlbfs: fix races and page leaks during migration Mike Kravetz
2019-02-14  1:32                 ` Mike Kravetz
2019-02-15 15:48                 ` Sasha Levin
2019-02-18 21:14                 ` Sasha Levin
2019-02-21  6:09                 ` Andrew Morton
2019-02-21 19:11                   ` Mike Kravetz
2019-02-21 19:47                     ` Andrew Morton
2019-02-26  7:44                     ` Naoya Horiguchi
2019-02-27  0:35                       ` Mike Kravetz

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190208023132.GA25778@hori1.linux.bs1.fc.nec.co.jp \
    --to=n-horiguchi@ah.jp.nec.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dave@stgolabs.net \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox