linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Phillips <phillips@arcor.de>
To: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@osdl.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hugh@veritas.com
Subject: Re: [RFC][patch 0/2] mm: remove PageReserved
Date: Sat, 13 Aug 2005 05:34:53 +1000	[thread overview]
Message-ID: <200508130534.54155.phillips@arcor.de> (raw)
In-Reply-To: <3521.1123757360@warthog.cambridge.redhat.com>

On Thursday 11 August 2005 20:49, David Howells wrote:
> Daniel Phillips <phillips@arcor.de> wrote:
> > To be honest I'm having some trouble following this through logically. 
> > I'll read through a few more times and see if that fixes the problem. 
> > This seems cluster-related, so I have an interest.
>
> Well, perhaps I can explain the function for which I'm using this page flag
> more clearly. You'll have to excuse me if it's covering stuff you don't
> know, but I want to take it from first principles; plus this stuff might
> well find its way into the kernel docs.
>
>
> We want to use a relatively fast medium (such as RAM or local disk) to
> speed up repeated accesses to a relatively slow medium (such as NFS, NBD,
> CDROM) by means of caching the results of previous accesses to the slow
> medium on the fast medium.
>
> Now we already do this at one level: RAM. The page cache _is_ such a cache,
> but whilst it's much faster than a disk, it is severely restricted in size

Did you just suggest that 16 TB/address_space is too small to cache NFS pages?

> compared to media such as disks, it's more expensive

It is?

> and it's contents generally don't last over power failure or reboots.

When used by RAMFS maybe.  But fortunately the page cache has a backing store 
API, in fact, that is its raison d'etre.

> The major attribute of the page cache is that the CPU can access it
> directly. 

You seem to have forgotten about non-resident pages.

> So we want to add another level: local disk. The FS-Cache/CacheFS patches
> permit such as AFS and NFS to use local disk as a cache.

The page cache already lets you do that.  I have not yet discerned a 
fundamental reason why you need to interface to another filesystem to 
implement backing store for an address_space.

> So, assume that NFS is using a local disk cache (it doesn't matter whether
> it's CacheFS, CacheFiles, or something else), and assume a process has a
> file open through NFS.
>
> The process attempts to read from the file. This causes the NFS readpage()
> or readpages() operation to be invoked to load the data into the page cache
> so that the CPU can make use of it.
>
> So the NFS page reading algorithm first consults the disk cache.  Assume 
> this returns a negative response - NFS will then read from the server into
> the page cache. Under cacheless operation, it would then unlock the page
> and the kernel could then let userspace play with it, but we're dealing
> with a cache, and so the newly fetched data must be stored in the disk
> cache for future retrieval.
>
> NFS now has three choices:
>
>  (1) It could institigate a write to the disk cache and wait for that to
>      complete before unlocking the page and letting userspace see it, but
> we don't know how long that might take.

Pages are typically unlocked while being written to backing store, e.g.:

http://lxr.linux.no/source/fs/buffer.c#L1839

What makes NFS special in this regard?

>      CacheFS immediately dispatches a write BIO to get it DMA'd to the disk
> as soon as possible, but something like CacheFiles is dependent on an
> underlying filesystem - be it EXT3, ReiserFS, XFS, etc. - to perform the
> write, and we've no control over that.

That is a problem you are in the process of inventing.

> 	Time to unlock: CacheMiss + NetRead + CacheWrite
> 	Cache reliable: Yes
>
>  (2) It could just unlock the page and let userspace scribble on it whilst
>      simultaneously writing it to the cache. But that means the DMA to the
>      disk may pick up some of userspace's scribblings, and that means you
>      can't trust what's in the cache in the event of a power loss.

I thought I saw a journal in there.  Anyway, if the user has asked for a racy 
write, that is what they should get.

>      This can be alleviated by marking untrustworthy files in the cache,
> but that then extends the management time in several ways.
>
> 	Time to unlock: CacheMiss + NetRead
> 	Cache reliable: No

I think your definition of trustworthy goes beyond what is required by Posix 
or Linux local filesystem semantics.

>  (3) It could tell the cache that the page needs writing to disk and then
>      unlock it for userspace to read, but intercept the change of a PTE
>      pointing to this page when it loses its write protection (PTEs start
> off read-only, generating a write protection fault on the first write).

We need to do something like this to implemented cross-node caching of 
shared-writeable mmaps.  This is another reason that your ideas need clear 
explanations: we need to go the rest of the way and get this sorted out for 
cluster filesystems in general, not just NFS (v4).  It does help a lot that 
you are attempting to explain what the needs of NFS actually are.  
Unfortunately, it seems you are proposing that this mechanism is essential 
even for single-node use, which is far from clear.

>      The interceptor would then force userspace to wait for the cache to
>      finish DMA'ing the page before writing to it.
>
>      Similarly, the write() or prepare_write() operations would wait for
> the cache to finish with that page.

Here you return to the assumption that the VFS should enforce per-page write 
granularity.  There is no such rule as far as I know.

> 	Time to unlock: CacheMiss + NetRead
> 	Cache reliable: Yes
>
> I originally chose option (1), but then I saw just how much it affected
> performance and worked on option (3).
>
> I discarded option (2) because I want to be able to have some surety about
> the state in the cache - I don't want to have to reinitialise it after a
> power failure. Imagine if you cache /usr... Imagine if everyone in a very
> large office caches /usr...
>
>
> So, the way I implemented (3) is to use an extra page flag to indicate a
> write underway to the cache, and thus allow cache write status to be
> determined when someone wants to scribble on a page.
>
> The fscache_write_page() function takes a pointer to a callback function.
> In NFS this function clears the PG_fs_misc bit on the appropriate pages and
> wakes up anyone who was waiting for this event (end_page_fs_misc()).
>
> The NFS page_mkwrite() VMA op calls wait_on_page_fs_misc() to wait on that
> page bit if it is set.
>
> > Who is using this interface?
>
> AFS and NFS will both use it. There may be others eventually who use it for
> the same purpose. CacheFS has a different use for it internally.

Let's try to clear up the page write atomicity question, please.  It seems 
your argument depends on it.

Regards,

Daniel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2005-08-12 19:34 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-08-07  3:28 Nick Piggin
2005-08-07  3:29 ` [patch 1/2] mm: remap ZERO_PAGE mappings Nick Piggin
2005-08-07  3:30   ` [patch 2/2] mm: core remove PageReserved Nick Piggin
2005-08-08 21:09 ` [RFC][patch 0/2] mm: " Daniel Phillips
2005-08-08 21:24   ` Daniel Phillips
2005-08-08 21:54     ` Andrew Morton
2005-08-09 23:23       ` [RFC][PATCH] Rename PageChecked as PageMiscFS Daniel Phillips
2005-08-10  7:48         ` Hugh Dickins
2005-08-10  8:06           ` Daniel Phillips
2005-08-10 22:12       ` Daniel Phillips
2005-08-10 22:23         ` Daniel Phillips
2005-08-10 22:34           ` Trond Myklebust
2005-08-10 22:57             ` Daniel Phillips
2005-08-10 23:23               ` Trond Myklebust
2005-08-11  9:42               ` David Howells
2005-08-10 23:42           ` Adrian Bunk
2005-08-11  9:46           ` David Howells
2005-08-12  2:34             ` Daniel Phillips
2005-08-12 12:32             ` David Howells
2005-08-11  9:31         ` David Howells
2005-08-11  9:26       ` David Howells
2005-08-12  3:29         ` Daniel Phillips
2005-08-12 12:41         ` David Howells
2005-08-12 13:28           ` Hugh Dickins
2005-08-16 13:59           ` Pavel Machek
2005-08-18 14:33           ` David Howells
2005-08-18 22:27             ` Pavel Machek
2005-08-19 10:04             ` David Howells
2005-08-19 16:31               ` Daniel Phillips
2005-08-20 10:45               ` David Howells
2005-08-20 20:21                 ` Daniel Phillips
2005-08-10 13:13     ` [RFC][patch 0/2] mm: remove PageReserved David Howells
2005-08-10 13:34       ` Daniel Phillips
2005-08-10 14:27       ` David Howells
2005-08-10 23:19         ` Daniel Phillips
2005-08-11 10:49         ` David Howells
2005-08-12 19:34           ` Daniel Phillips [this message]
2005-08-15 13:15           ` David Howells
2005-08-16  1:53             ` Daniel Phillips
2005-08-16 10:28             ` David Howells
2005-08-09  0:15   ` Nick Piggin
2005-08-09  8:51     ` Benjamin Herrenschmidt
2005-08-09  9:49       ` Nick Piggin
2005-08-09 19:19         ` Daniel Phillips
2005-08-09 19:22         ` Daniel Phillips
2005-08-10 21:50           ` Pavel Machek
2005-08-10 21:56             ` Martin J. Bligh
2005-08-11 10:36               ` Rafael J. Wysocki
2005-08-12 19:56                 ` Daniel Phillips
2005-08-12 22:20                   ` Rafael J. Wysocki
2005-08-12 23:04                     ` Daniel Phillips
2005-08-13  7:06                       ` Rafael J. Wysocki
2005-08-11 10:26             ` Rafael J. Wysocki
2005-08-09 11:25       ` Hugh Dickins
2005-08-09 14:31         ` Benjamin Herrenschmidt
2005-08-09 14:50           ` Hugh Dickins
2005-08-09 14:49             ` Benjamin Herrenschmidt
2005-08-09 15:36               ` Hugh Dickins
2005-08-09 21:27                 ` Daniel Phillips
2005-08-09 19:14     ` Daniel Phillips
2005-08-09 20:17       ` Hugh Dickins
2005-08-09 20:52         ` Daniel Phillips
2005-08-09  4:39   ` Nigel Cunningham
2005-08-09  4:59     ` Nick Piggin
2005-08-09  5:11       ` Nigel Cunningham
2005-08-09  5:20         ` Nick Piggin
2005-08-09  5:30           ` Nigel Cunningham
2005-08-09  7:08       ` Russell King
2005-08-09  8:38         ` Arjan van de Ven
2005-08-09  9:31           ` Nick Piggin
2005-08-09  9:49             ` Arjan van de Ven
2005-08-09  9:57               ` Nick Piggin
2005-08-09 10:24             ` Rafael J. Wysocki
2005-08-09  8:53         ` Benjamin Herrenschmidt
2005-08-09  9:15         ` Hugh Dickins
2005-08-09 10:27           ` Nick Piggin
2005-08-09 11:15             ` Hugh Dickins
2005-08-09 13:15               ` Nick Piggin
2005-08-09 13:26                 ` Arjan van de Ven
2005-08-09 14:28               ` Benjamin Herrenschmidt
2005-08-09 14:47                 ` Hugh Dickins
2005-08-09 19:49           ` Roman Zippel
2005-08-09  9:29         ` Nick Piggin
2005-08-09 19:40           ` Russell King
2005-08-09 14:38         ` Martin J. Bligh
2005-08-09 19:41           ` Russell King
2005-08-09 20:51             ` Linus Torvalds
2005-08-09 21:16             ` Martin J. Bligh
2005-08-09 21:51               ` Martin J. Bligh
2005-08-10  9:27             ` Benjamin Herrenschmidt
2005-08-11  9:09               ` Nick Piggin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200508130534.54155.phillips@arcor.de \
    --to=phillips@arcor.de \
    --cc=akpm@osdl.org \
    --cc=dhowells@redhat.com \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox