* A use case for MAP_COPY
@ 2017-01-05 6:37 George Spelvin
2017-01-05 18:59 ` Linus Torvalds
0 siblings, 1 reply; 6+ messages in thread
From: George Spelvin @ 2017-01-05 6:37 UTC (permalink / raw)
To: akpm, linux-mm, torvalds; +Cc: linux
Back in 2001, Linus had some very negative things to say about MAP_COPY.
I'm going to try to change that opinion.
> The thing with MAP_COPY is that how do you efficiently _detect_ somebody
> elses changes on a page that you haven't even read in yet?
>
> So you have a few choices, all bad:
>
> - immediately reading in everything, basically turning the mmap() into a
> read. Obviously a bad idea.
>
> - mark the inode as a "copy" inode, and whenever somebody writes to it,
> you not only make sure that you do copy-on-write on the page cache page
> (which, btw, is pretty much impossible - how did you intend to find all
> the other _non_COPY_ users that _want_ coherency).
>
> You also have to make sure that if somebody changes the page, you have
> to read in the old contents first (not normally needed for most
> changes that write over at least a full block), but you also have to
> save the old page somewhere so that the mapping can use it if it faults
> it in later. And how the hell do you do THAT? Especially as you can
> have multiple generations of inodes with different sets of "MAP_COPY"
> on different contents..
>
> In short, now you need filesystem versioning at a per-page level etc.
>
> Trust me. The people who came up with MAP_COPY were stupid. Really. It's
> an idiotic concept, and it's not worth implementing.
I think I have a semantic for MAP_COPY that is both efficiently
implementable and useful.
The meaning is "For each page in the mapping, a snapshot of the backing
file is taken at some undefined time between the mmap() call and the
first access to the mapped memory. The time of the snapshot may (will!)
be different for each page. Once taken, the snapshot will not be affected
by later writes to the file.
This does not solve any problems having to do with atomic update of files.
You still need to do the copy-and-rename dance to do an atomic update
larger than a single page.
What it *does* solve is time-of-check-to-time-of-use security problems
in the caller. Once I've checked the file for corruption, I can
rely on it staying uncorrupted.
Once I've checked it (parsed, validated, checksummed, whatever), I can
use data structures in the mapped file directly in internal code without
fear of a TOCTTOU race.
Without MAP_COPY, my choices are:
- Explicit copy each time, or
- Greatly expanding the amount of code that has to be robust against
TOCTTOU races.
The former is a waste of time and memory 99% of the time, because the
input file *isn't* being changed.
The goal of this is to provide the same sort of guarantees that
SHMEM_SET_SEALS does. The bytes we map may be arbitrarily corrupted
by malicious writers, but at least we only see *one* set of bytes.
We don't have to worry about them changing underneath us.
Now, implementation-wise, I hope it's obvious that the "undefined time
between the mmap call and first access" when the snapshot is taken
is when the page is faulted in. Which the kernel may do whenever it
damn well pleases.
The whole "what if it's not read in yet?" question goes away, because
no guarantees apply until it is.
Once a page is read in, the kernel may clone it at any time that's
convenient. Avoiding this is an efficiency goal, but it's the all-purpose
solution to awkward corner cases. In particular, that's what you do
if the page gets evicted. No support from file systems is required; if
the page is evicted, the file mapping is removed, and the page remains
as as an anonymous page (copy achieved). A later eviction attempt will
then push it to the swap file.
Implementation isn't effortless; the COW operation is more complex than
for MAP_PRIVATE.
When a write happens, we don't just fork off a copy for the writing mm.
Rather, the mappings have to be divided into MAP_COPY users and others,
and one of those sets moved to a new page. We can either leave the
snapshot in place and move the named map, or we can make a copy and
avoid touching the file system cache. I haven't figured out which is
easier yet.
Still, it doesn't seem hopelessly impractical. And it seems useful.
What do other people think?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A use case for MAP_COPY
2017-01-05 6:37 A use case for MAP_COPY George Spelvin
@ 2017-01-05 18:59 ` Linus Torvalds
2017-01-05 21:10 ` George Spelvin
0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2017-01-05 18:59 UTC (permalink / raw)
To: George Spelvin; +Cc: Andrew Morton, linux-mm
On Wed, Jan 4, 2017 at 10:37 PM, George Spelvin
<linux@sciencehorizons.net> wrote:
> Back in 2001, Linus had some very negative things to say about MAP_COPY.
> I'm going to try to change that opinion.
Not going to happen.
Basically, the way you can change that opinion is if you can show some
clever zero-cost versioning model that "just work". With an actual
patch.
Because I'm not seeing it.
And without it being zero cost to all the _real_ users, I'm not adding
a MAP_COPY that absolutely nobody will ever use because it's not
standard, and it's not useful enough to them.
We've had a history of failed clever interfaces that end up being very
painful to maintain (splice() being the most obvious one, but we've
had a numebr of filesystem innovations that just didn't work either,
devfs being the most spectacularly bad one).
> I think I have a semantic for MAP_COPY that is both efficiently
> implementable and useful.
The semantic meaning is not my worry. The implementation is.
> The meaning is "For each page in the mapping, a snapshot of the backing
> file is taken at some undefined time between the mmap() call and the
> first access to the mapped memory. The time of the snapshot may (will!)
> be different for each page. Once taken, the snapshot will not be affected
> by later writes to the file.
Show me the efficient implementation.
I see the trivial part: at page fault time, just do a COW if the page
has any other users. But to know if it has "users", you now need
another count that distinguishes between plain other mappings or
*writable* mappings (so "mapcount" needs to be split up).
That part is fairly simple, because the "new writable mappings" is
hopefully just in a few places.
But the hard part is for all *other* users that might write to the
page now need to do the cow for somebody else. So it basically
requires a per-page count (possibly just flag) of "this has a copy
mapping", along with everybody who might write to it that currently
just get a ref to the page to check it, and do the rmap thing etc.
And just creating those two new fields is a big problem. We literally
had a long discussion just about getting a single new _bit_ free'd up
in the page flags, because things are so tight. You need two new
fields entirely.
I'm not saying it's impossible. But it's a lot of details (and that
extra field to a very core data structure really is surprisingly
painful) for some very dubious gains. People simply won't be using it.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A use case for MAP_COPY
2017-01-05 18:59 ` Linus Torvalds
@ 2017-01-05 21:10 ` George Spelvin
2017-01-05 22:14 ` Kirill A. Shutemov
2017-01-05 22:49 ` Linus Torvalds
0 siblings, 2 replies; 6+ messages in thread
From: George Spelvin @ 2017-01-05 21:10 UTC (permalink / raw)
To: linux, torvalds; +Cc: akpm, kirill.shutemov, linux-mm, mgorman, riel
Linus Torvalds wrote:
> On Wed, Jan 4, 2017 at 10:37 PM, George Spelvin
> <linux@sciencehorizons.net> wrote:
> Back in 2001, Linus had some very negative things to say about MAP_COPY.
>> I'm going to try to change that opinion.
> Not going to happen.
Really? Because the rest of your response is a lot more encouraging.
> Basically, the way you can change that opinion is if you can show some
> clever zero-cost versioning model that "just work". With an actual
> patch.
That's the response I was hoping for! That's a change from "it's a
stupid idea and crazily impractical" to "I seriously doubt it can be done
cheap enough."
> And without it being zero cost to all the _real_ users, I'm not adding
> a MAP_COPY that absolutely nobody will ever use because it's not
> standard, and it's not useful enough to them.
FWIW, I was writing some code and wishing for some semantics like this,
which is what led me to learn about MAP_COPY and all that.
I have a big config file full of strings, which I parse and index.
The vast majority of them contain no metacharacters, and I thought I
could just cache a (ptr, len) into the mapped config file, and save a
lot of allocation and copying. But someone could put a metacharacter
into the file after I parse it.
Would that constitute a security problem? Damn it, now I have to do a
much more complex analysis. Moan, bitch, grumble, whinge, "there ought
to be a way." And this idea popped out.
The thing is, TOCTTOU is a well-known security problem. We already have
custom interfaces in the kernel specifically to address this issue.
So it seemed possible that this might be of broader interest.
> We've had a history of failed clever interfaces that end up being very
> painful to maintain (splice() being the most obvious one, but we've
> had a numebr of filesystem innovations that just didn't work either,
> devfs being the most spectacularly bad one).
Absolutely. That's why I wanted to float the idea before I did a ton
of implementation work and got emotionally attached to the result.
> But the hard part is for all *other* users that might write to the
> page now need to do the cow for somebody else. So it basically
> requires a per-page count (possibly just flag) of "this has a copy
> mapping", along with everybody who might write to it that currently
> just get a ref to the page to check it, and do the rmap thing etc.
Yes, that's the same thing I identified as the unsolved hard part.
I'm going to need to go away and study dark MM lore for a while.
I agree the implementation may run into trouble, but "now we're just
haggling over the price". That's a big difference from the *idea*
being stupid because no possible implementation is practical.
The nice thing is that I don't care very much how expensive the COW is.
It's Not Supposed To Happen unless there's a legitimate race condition
bug or an illegtimate race condition explot. It just has to be less of
a DoS attack than MAP_DENYWRITE.
Thank you very much for your insights into the implementation
practicalities. I'll direct more detailed discussions to people
like Rik, Mel and Kirill.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A use case for MAP_COPY
2017-01-05 21:10 ` George Spelvin
@ 2017-01-05 22:14 ` Kirill A. Shutemov
2017-01-05 22:49 ` Linus Torvalds
1 sibling, 0 replies; 6+ messages in thread
From: Kirill A. Shutemov @ 2017-01-05 22:14 UTC (permalink / raw)
To: George Spelvin; +Cc: torvalds, akpm, kirill.shutemov, linux-mm, mgorman, riel
On Thu, Jan 05, 2017 at 04:10:56PM -0500, George Spelvin wrote:
> It just has to be less of a DoS attack than MAP_DENYWRITE.
It's easy to turn MAP_COPY into DoS:
- in endless loop: mmap(MAP_COPY|MAP_FIXED) a victim file 1000 times (by
distinct addresses) into your address space;
- any attempt to write to the file would require to go through all
mapping and put new page in every one;
- by the time you've done with all 1000 VMAs, attacker created new bunch
for you.
There's no way to guarantee it would ever complete (nasty hacks into
scheduler don't count).
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A use case for MAP_COPY
2017-01-05 21:10 ` George Spelvin
2017-01-05 22:14 ` Kirill A. Shutemov
@ 2017-01-05 22:49 ` Linus Torvalds
2017-01-06 1:08 ` George Spelvin
1 sibling, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2017-01-05 22:49 UTC (permalink / raw)
To: George Spelvin
Cc: Andrew Morton, Kirill A. Shutemov, linux-mm, Mel Gorman, Rik van Riel
On Thu, Jan 5, 2017 at 1:10 PM, George Spelvin
<linux@sciencehorizons.net> wrote:
>
>> Not going to happen.
>
> Really? Because the rest of your response is a lot more encouraging.
The thing is, I don't think you can do it with a reasonable patch. It
just gets too nasty.
For example, what happens when there is low memory? What you would
*want* to happen is to just forget the page and read it back in.
That/s how MAP_PRIVATE works. But that won't actually work for
MAP_COPY. You'd need to page the thing out, as if you had written to
it (even though you didn't). Not because you want to, but because your
versioning scheme depends on it.
So how are y ou going to solve that versioning probnlem wrt memory
pressure? The whole point of MAP_COPY is to avoid a memory copy, but
if you now end up having to do IO, and having to have a swap device
for it, it's completely unacceptable. See?
How are you going to avoid the issues with growing 'struct page'?
So the fact is, it's a horrible idea. I don't think you understand how
horrible it is. The only way you'll understand is if you try to write
the patch.
"Siperia opettaa".
So you can try to prove me wrong by sending a patch. I doubt you will.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: A use case for MAP_COPY
2017-01-05 22:49 ` Linus Torvalds
@ 2017-01-06 1:08 ` George Spelvin
0 siblings, 0 replies; 6+ messages in thread
From: George Spelvin @ 2017-01-06 1:08 UTC (permalink / raw)
To: linux, torvalds; +Cc: akpm, kirill.shutemov, linux-mm, mgorman, riel
> For example, what happens when there is low memory? What you would
> *want* to happen is to just forget the page and read it back in.
> That/s how MAP_PRIVATE works. But that won't actually work for
> MAP_COPY. You'd need to page the thing out, as if you had written to
> it (even though you didn't). Not because you want to, but because your
> versioning scheme depends on it.
Yes, I explained that in the first message. For memory overcommit
bean-counting purposes, it counts as a copy. When there's a request to
shrink the page, the process looks like this:
- Page dirty? Schedule write.
- Page clean, but MAP_COPY? Drop file mappings, leave dirty anonymous
page behind. Optionally (but recommended) add_to_swap() and
schedule swap-out.
- (From this point, it's a generic anonymous page.)
The net result is no worse than if you'd made a private copy in the
first place.
(In *really* extreme corner cases, point the oom-killer at whoever asked
for MAP_COPY and cannibalize them so the others may live.)
The basic performance goals are:
- If the COW never happens: No slower than, and less memory than,
making an eager copy up front.
- If the COW happens: Not more than 10x slower than, and no more memory
than, making an eager copy up front.
The net result is that if the chance of a COW is less than 10%, it's
worth considering. If the chance of a COW is non-trivial, just do
an eager copy.
> The whole point of MAP_COPY is to avoid a memory copy, but
> if you now end up having to do IO, and having to have a swap device
> for it, it's completely unacceptable. See?
No, I don't see. I thought I figured that out before posting and
explained it already. You can do the required virtual copy with no
actual RAM copies; you just have to swap the same page out twice.
The easy way to implement it serializes the two writes, which isn't
ideal, but isn't a disaster, either.
> How are you going to avoid the issues with growing 'struct page'?
At the moment, no idea. Compared to your profound grokking of the mm,
I'm like one of those mechanics saying "lookit all them WIRES in there!"
I certainly understand that growing struct page is a non-starter.
> So the fact is, it's a horrible idea. I don't think you understand how
> horrible it is. The only way you'll understand is if you try to write
> the patch.
Agreed, I definitely don't understand. For me, mm/ is a blank spot on
the map marked "Hic sunt dracones." While that's better than "Lasciate
ogne speranza, voi ch'intrate", it's still very intimidating. The part
I'm most frightened of is lock ordering. That's a maze of twisty little
passages.
And the cgroup accounting is likely to be unpleasant in the extreme.
This is definitely a long-term goal. I have to go and finish the
software that made me wish for this feature first. And then a lot
of other to-do items. But I'll start studying.
> "Siperia opettaa".
Very appropriate aphorism!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2017-01-06 1:08 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-05 6:37 A use case for MAP_COPY George Spelvin
2017-01-05 18:59 ` Linus Torvalds
2017-01-05 21:10 ` George Spelvin
2017-01-05 22:14 ` Kirill A. Shutemov
2017-01-05 22:49 ` Linus Torvalds
2017-01-06 1:08 ` George Spelvin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox