linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Ken Chen <kenchen@google.com>
Cc: William Lee Irwin III <wli@holomorphy.com>,
	linux-mm@kvack.org, agl@us.ibm.com, dwg@au1.ibm.com
Subject: Re: FADV_DONTNEED on hugetlbfs files broken
Date: Sun, 18 Mar 2007 10:27:11 -0700	[thread overview]
Message-ID: <20070318172711.GA12978@us.ibm.com> (raw)
In-Reply-To: <b040c32a0703180043t29c675bfr9a9554575a261f96@mail.gmail.com>

On 18.03.2007 [00:43:01 -0700], Ken Chen wrote:
> On 3/17/07, Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> >Yes, that could be :) Sorry if my e-mail indicated I was asking
> >otherwise. I don't want Ken's commit to be reverted, as that would
> >make hugepages very nearly unusable on x86 and x86_64. But I had
> >found a functional change and wanted it to be documented. If
> >hugepages can no longer be dropped from the page cache, then we
> >should make sure that is clear (and expected/desired).
> 
> Oh gosh, I think you are really abusing the buggy hugetlb behavior in
> the dark age of 2.6.19.  Hugetlb file does not have disk based backing
> store.  The in-core page that resides in the page cache is the only
> copy of the file.  For pages that are dirty, there are no place to
> sync them to and thus they have to stay in the page cache for the life
> of the file.

And 2.6.20, fwiw. Your explanation makes sense. Frustrating, though,
since it means segment remapping uses twice as many huge pages as it
needs to for each writable segment.

> And currently, there is no way to allocate hugetlb page in "clean"
> state because we can't mmap hugetlb page onto a disk file.  So pages
> for live file in hugetlbfs are always being written to initially and
> it is just not possible to drop them out of page cache, otherwise we
> suffer from data corruption.

Let's be clear, for the sake of the archives of the world, this is only
for *writable* allocations. In make_huge_pte():

        if (writable) {
                entry =
                    pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
        } else {
                entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot));
        }

Probably obvious to anyone, since you need to be able to dirty the page to have
it in a dirty state.

> >Now, even if I call fsync() on the file descriptor, I still don't get
> >the pages out of the page cache. It seems to me like fsync() would
> >clear the dirty state -- although perhaps with Ken's patch, writable
> >hugetlbfs pages will *always* be dirty? I'm still trying to figure
> >out what ever clears that dirty state (in hugetlbfs or anywhere
> >else). Seems like hugetlbfs truncates call cancel_dirty_page(), but
> >the comment there indicates it's only for truncates.
> 
> fsync can not drop dirty pages out of page cache because there are no
> backing store.  I believe truncate is the only way to remove hugetlb
> page out of page cache.

Which won't work here, because we don't want to lose the data. We just
want to drop the original MAP_SHARED copy of the file out of the
page_cache. I tried ftruncate()'ing the file down to 0 after we've
mapped it PRIVATE and COW'd each hugepage, but then the process
(obviously) SEGVs. We lose all hugepages in the page cache.

> >> Perhaps we should ask what ramfs, tmpfs, et al would do. Or, for
> >> that matter, if they suffer from the same issue as Ken Chen
> >> identified for hugetlbfs. Perhaps the issue is not hugetlb's dirty
> >> state, but drop_pagecache_sb() failing to check the bdi for
> >> BDI_CAP_NO_WRITEBACK.  Or perhaps what safety guarantees
> >> drop_pagecache_sb() is supposed to have or lack.
> 
> I looked, ramfs and tmpfs does the same thing.  fadvice(DONTNEED)
> doesn't do anything to live files.

Ok, thanks for looking into it, Ken.

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

      reply	other threads:[~2007-03-18 17:27 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-17  5:13 Nishanth Aravamudan
2007-03-17  6:13 ` William Lee Irwin III
2007-03-17 19:37   ` Nishanth Aravamudan
2007-03-18  2:13     ` William Lee Irwin III
2007-03-18  7:43     ` Ken Chen
2007-03-18 17:27       ` Nishanth Aravamudan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070318172711.GA12978@us.ibm.com \
    --to=nacc@us.ibm.com \
    --cc=agl@us.ibm.com \
    --cc=dwg@au1.ibm.com \
    --cc=kenchen@google.com \
    --cc=linux-mm@kvack.org \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox