* Re: [PATCH] block devices: validate block device capacity
[not found] ` <1391125027.2181.114.camel@dabdike.int.hansenpartnership.com>
@ 2014-01-31 0:20 ` Mikulas Patocka
2014-01-31 1:43 ` James Bottomley
0 siblings, 1 reply; 7+ messages in thread
From: Mikulas Patocka @ 2014-01-31 0:20 UTC (permalink / raw)
To: James Bottomley
Cc: Jens Axboe, Alasdair G. Kergon, Mike Snitzer, dm-devel,
David S. Miller, linux-ide, linux-scsi, linux-kernel, Neil Brown,
linux-raid, linux-mm
On Thu, 30 Jan 2014, James Bottomley wrote:
> On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
> >
> > On Thu, 30 Jan 2014, James Bottomley wrote:
> >
> > > Why is this? the whole reason for CONFIG_LBDAF is supposed to be to
> > > allow 64 bit offsets for block devices on 32 bit. It sounds like
> > > there's somewhere not using sector_t ... or using it wrongly which needs
> > > fixing.
> >
> > The page cache uses unsigned long as a page index. Therefore, if unsigned
> > long is 32-bit, the block device may have at most 2^32-1 pages.
>
> Um, that's the index into the mapping, not the device; a device can have
> multiple mappings and each mapping has a radix tree of pages. For most
> filesystems a mapping is equivalent to a file, so we can have large
> filesystems, but they can't have files over actually 4GB on 32 bits
> otherwise mmap fails.
A device may be accessed direcly (by opening /dev/sdX) and it creates a
mapping too - thus, the size of a mapping limits the size of a block
device.
The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may
fix it - but there may be some hidden places where pgoff is converted to
unsigned long - who knows, if they exist or not?
> Are we running into a problems with struct address_space where we've
> assumed the inode belongs to the file and lvm is doing something where
> it's the whole device?
lvm creates a 64TiB device, udev runs blkid on that device and blkid opens
the device and gets stuck because of unsigned long overflow.
> > > > On 32-bit architectures, we must limit block device size to
> > > > PAGE_SIZE*(2^32-1).
> > >
> > > So you're saying CONFIG_LBDAF can never work, why?
> > >
> > > James
> >
> > CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86,
> > without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is
> > 16TiB (4096*2^32).
>
> I don't think the people who did the large block device work expected to
> gain only 3 bits for all their pain.
>
> James
One could change it to have three choices:
2TiB limit - 32-bit sector_t and 32-bit pgoff_t
16TiB limit - 64-bit sector_t and 32-bit pgoff_t
32PiB limit - 64-bit sector_t and 64-bit pgoff_t
Though, we need to know if the people who designed memory management agree
with changing pgoff_t to 64 bits.
Mikulas
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] block devices: validate block device capacity
2014-01-31 0:20 ` [PATCH] block devices: validate block device capacity Mikulas Patocka
@ 2014-01-31 1:43 ` James Bottomley
2014-01-31 2:43 ` Mikulas Patocka
0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2014-01-31 1:43 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, Alasdair G. Kergon, Mike Snitzer, dm-devel,
David S. Miller, linux-ide, linux-scsi, linux-kernel, Neil Brown,
linux-raid, linux-mm
On Thu, 2014-01-30 at 19:20 -0500, Mikulas Patocka wrote:
>
> On Thu, 30 Jan 2014, James Bottomley wrote:
>
> > On Thu, 2014-01-30 at 18:10 -0500, Mikulas Patocka wrote:
> > >
> > > On Thu, 30 Jan 2014, James Bottomley wrote:
> > >
> > > > Why is this? the whole reason for CONFIG_LBDAF is supposed to be to
> > > > allow 64 bit offsets for block devices on 32 bit. It sounds like
> > > > there's somewhere not using sector_t ... or using it wrongly which needs
> > > > fixing.
> > >
> > > The page cache uses unsigned long as a page index. Therefore, if unsigned
> > > long is 32-bit, the block device may have at most 2^32-1 pages.
> >
> > Um, that's the index into the mapping, not the device; a device can have
> > multiple mappings and each mapping has a radix tree of pages. For most
> > filesystems a mapping is equivalent to a file, so we can have large
> > filesystems, but they can't have files over actually 4GB on 32 bits
> > otherwise mmap fails.
>
> A device may be accessed direcly (by opening /dev/sdX) and it creates a
> mapping too - thus, the size of a mapping limits the size of a block
> device.
Right, that's what I suspected below. We can't damage large block
support on filesystems just because of this corner case.
> The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may
> fix it - but there may be some hidden places where pgoff is converted to
> unsigned long - who knows, if they exist or not?
I don't think we want to do that ... it will make struct page fatter and
have knock on impacts in the radix tree code. To fix this, we need to
make the corner case (i.e. opening large block devices without a
filesystem) bear the pain. It sort of looks like we want to do a linear
array of mappings of 64TB for the device so the page cache calculations
don't overflow.
> > Are we running into a problems with struct address_space where we've
> > assumed the inode belongs to the file and lvm is doing something where
> > it's the whole device?
>
> lvm creates a 64TiB device, udev runs blkid on that device and blkid opens
> the device and gets stuck because of unsigned long overflow.
well a simple open won't cause this ... it must be trying to read the
end of the device for some reason. But anyway, the way to fix this is
to fix the large block open as a corner case.
> > > > > On 32-bit architectures, we must limit block device size to
> > > > > PAGE_SIZE*(2^32-1).
> > > >
> > > > So you're saying CONFIG_LBDAF can never work, why?
> > > >
> > > > James
> > >
> > > CONFIG_LBDAF works, but it doesn't allow unlimited capacity: on x86,
> > > without CONFIG_LBDAF, the limit is 2TiB. With CONFIG_LBDAF, the limit is
> > > 16TiB (4096*2^32).
> >
> > I don't think the people who did the large block device work expected to
> > gain only 3 bits for all their pain.
> >
> > James
>
> One could change it to have three choices:
> 2TiB limit - 32-bit sector_t and 32-bit pgoff_t
> 16TiB limit - 64-bit sector_t and 32-bit pgoff_t
> 32PiB limit - 64-bit sector_t and 64-bit pgoff_t
>
> Though, we need to know if the people who designed memory management agree
> with changing pgoff_t to 64 bits.
I don't think we can change the size of pgoff_t ... because it won't
just be that, it will be other problems like the radix tree.
However, you also have to bear in mind that truncating large block
device support to 64TB on 32 bits is a technical ABI break. Hopefully
it is only technical because I don't know of any current consumer block
device that is 64TB yet, but anyone who'd created a filesystem >64TB
would find it no-longer mounted on 32 bits.
James
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] block devices: validate block device capacity
2014-01-31 1:43 ` James Bottomley
@ 2014-01-31 2:43 ` Mikulas Patocka
2014-01-31 5:45 ` James Bottomley
0 siblings, 1 reply; 7+ messages in thread
From: Mikulas Patocka @ 2014-01-31 2:43 UTC (permalink / raw)
To: James Bottomley
Cc: Jens Axboe, Alasdair G. Kergon, Mike Snitzer, dm-devel,
David S. Miller, linux-ide, linux-scsi, linux-kernel, Neil Brown,
linux-raid, linux-mm
On Thu, 30 Jan 2014, James Bottomley wrote:
> > A device may be accessed direcly (by opening /dev/sdX) and it creates a
> > mapping too - thus, the size of a mapping limits the size of a block
> > device.
>
> Right, that's what I suspected below. We can't damage large block
> support on filesystems just because of this corner case.
Devices larger than 16TiB never worked on 32-bit kernel, so this patch
isn't damaging anything.
Note that if you attach a 16TiB block device, don't open it and mount it,
it still won't work, because the buffer cache uses the page cache (see the
function __find_get_block_slow and the variable "pgoff_t index" - that
variable would overflow if the filesystem accessed a buffer beyond 16TiB).
> > The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may
> > fix it - but there may be some hidden places where pgoff is converted to
> > unsigned long - who knows, if they exist or not?
>
> I don't think we want to do that ... it will make struct page fatter and
> have knock on impacts in the radix tree code. To fix this, we need to
> make the corner case (i.e. opening large block devices without a
> filesystem) bear the pain. It sort of looks like we want to do a linear
> array of mappings of 64TB for the device so the page cache calculations
> don't overflow.
The code that reads and writes data to block devices and files is shared -
the functions in mm/filemap.c work for both files and block devices.
So, if you want 64-bit page offsets, you need to increase pgoff_t size,
and that will increase the limit for both files and block devices.
You shouldn't have separate functions for managing pages on files and
separate functions for managing pages on block devices - that would
increase code size and cause maintenance problems.
> > Though, we need to know if the people who designed memory management agree
> > with changing pgoff_t to 64 bits.
>
> I don't think we can change the size of pgoff_t ... because it won't
> just be that, it will be other problems like the radix tree.
If we can't change it, then we must stay with the current 16TiB limit.
There's no other way.
> However, you also have to bear in mind that truncating large block
> device support to 64TB on 32 bits is a technical ABI break. Hopefully
> it is only technical because I don't know of any current consumer block
> device that is 64TB yet, but anyone who'd created a filesystem >64TB
> would find it no-longer mounted on 32 bits.
> James
It is not ABI break, because block devices larger than 16TiB never worked
on 32-bit architectures. So it's better to refuse them outright, than to
cause subtle lockups or data corruption.
Mikulas
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] block devices: validate block device capacity
2014-01-31 2:43 ` Mikulas Patocka
@ 2014-01-31 5:45 ` James Bottomley
2014-01-31 8:20 ` Mikulas Patocka
0 siblings, 1 reply; 7+ messages in thread
From: James Bottomley @ 2014-01-31 5:45 UTC (permalink / raw)
To: Mikulas Patocka
Cc: Jens Axboe, Alasdair G. Kergon, Mike Snitzer, dm-devel,
David S. Miller, linux-ide, linux-scsi, linux-kernel, Neil Brown,
linux-raid, linux-mm
On Thu, 2014-01-30 at 21:43 -0500, Mikulas Patocka wrote:
>
> On Thu, 30 Jan 2014, James Bottomley wrote:
>
> > > A device may be accessed direcly (by opening /dev/sdX) and it creates a
> > > mapping too - thus, the size of a mapping limits the size of a block
> > > device.
> >
> > Right, that's what I suspected below. We can't damage large block
> > support on filesystems just because of this corner case.
>
> Devices larger than 16TiB never worked on 32-bit kernel, so this patch
> isn't damaging anything.
expectations: 32 bit with CONFIG_LBDAF is supposed to be able to do
almost everything 64 bits can
> Note that if you attach a 16TiB block device, don't open it and mount it,
> it still won't work, because the buffer cache uses the page cache (see the
> function __find_get_block_slow and the variable "pgoff_t index" - that
> variable would overflow if the filesystem accessed a buffer beyond 16TiB).
That depends on the layout of the fs metadata.
> > > The main problem is that pgoff_t has 4 bytes - chaning it to 8 bytes may
> > > fix it - but there may be some hidden places where pgoff is converted to
> > > unsigned long - who knows, if they exist or not?
> >
> > I don't think we want to do that ... it will make struct page fatter and
> > have knock on impacts in the radix tree code. To fix this, we need to
> > make the corner case (i.e. opening large block devices without a
> > filesystem) bear the pain. It sort of looks like we want to do a linear
> > array of mappings of 64TB for the device so the page cache calculations
> > don't overflow.
>
> The code that reads and writes data to block devices and files is shared -
> the functions in mm/filemap.c work for both files and block devices.
Yes.
> So, if you want 64-bit page offsets, you need to increase pgoff_t size,
> and that will increase the limit for both files and block devices.
No. The point is the page cache mapping of the device uses a
manufactured inode saved in the backing device. It looks fixable in the
buffer code before the page cache gets involved.
> You shouldn't have separate functions for managing pages on files and
> separate functions for managing pages on block devices - that would
> increase code size and cause maintenance problems.
It wouldn't it would add structure to the buffer cache for large
devices.
> > > Though, we need to know if the people who designed memory management agree
> > > with changing pgoff_t to 64 bits.
> >
> > I don't think we can change the size of pgoff_t ... because it won't
> > just be that, it will be other problems like the radix tree.
>
> If we can't change it, then we must stay with the current 16TiB limit.
> There's no other way.
>
> > However, you also have to bear in mind that truncating large block
> > device support to 64TB on 32 bits is a technical ABI break. Hopefully
> > it is only technical because I don't know of any current consumer block
> > device that is 64TB yet, but anyone who'd created a filesystem >64TB
> > would find it no-longer mounted on 32 bits.
> > James
>
> It is not ABI break, because block devices larger than 16TiB never worked
> on 32-bit architectures. So it's better to refuse them outright, than to
> cause subtle lockups or data corruption.
An ABI is a contract between the userspace and the kernel. Saying we
can remove a clause in the contract because no-one ever exercised it and
not call it changing the contract is sophistry. The correct thing to do
would be to call it a bug and fix it.
In a couple of short years we'll be over 16TB for hard drives. I don't
really want to be the one explaining to the personal storage people that
the only way to install a 16+TB drive in their arm (or quark) based
Linux systems is a processor upgrade.
I suppose there are a couple of possibilities: pgoff_t + radix tree
expansion or double radix tree in the buffer code. This should probably
be taken to fsdevel where they might have better ideas.
James
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] block devices: validate block device capacity
2014-01-31 5:45 ` James Bottomley
@ 2014-01-31 8:20 ` Mikulas Patocka
2014-02-03 8:15 ` Christoph Hellwig
0 siblings, 1 reply; 7+ messages in thread
From: Mikulas Patocka @ 2014-01-31 8:20 UTC (permalink / raw)
To: James Bottomley
Cc: Jens Axboe, Alasdair G. Kergon, Mike Snitzer, dm-devel,
David S. Miller, linux-ide, linux-scsi, linux-kernel, Neil Brown,
linux-raid, linux-mm
On Thu, 30 Jan 2014, James Bottomley wrote:
> > So, if you want 64-bit page offsets, you need to increase pgoff_t size,
> > and that will increase the limit for both files and block devices.
>
> No. The point is the page cache mapping of the device uses a
> manufactured inode saved in the backing device. It looks fixable in the
> buffer code before the page cache gets involved.
So if you think you can support 16TiB devices and leave pgoff_t 32-bit,
send a patch that does it.
Until you make it, you should apply the patch that I sent, that prevents
kernel lockups or data corruption when the user uses 16TiB device on
32-bit kernel.
Mikulas
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] block devices: validate block device capacity
2014-01-31 8:20 ` Mikulas Patocka
@ 2014-02-03 8:15 ` Christoph Hellwig
2014-02-03 20:22 ` Mikulas Patocka
0 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2014-02-03 8:15 UTC (permalink / raw)
To: Mikulas Patocka
Cc: James Bottomley, Jens Axboe, Alasdair G. Kergon, Mike Snitzer,
dm-devel, David S. Miller, linux-ide, linux-scsi, linux-kernel,
Neil Brown, linux-raid, linux-mm
On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote:
> So if you think you can support 16TiB devices and leave pgoff_t 32-bit,
> send a patch that does it.
>
> Until you make it, you should apply the patch that I sent, that prevents
> kernel lockups or data corruption when the user uses 16TiB device on
> 32-bit kernel.
Exactly. I had actually looked into support for > 16TiB devices for
a NAS use case a while ago, but when explaining the effort involves
the idea was dropped quickly. The Linux block device is too deeply
tied to the pagecache to make it easily feasible.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] block devices: validate block device capacity
2014-02-03 8:15 ` Christoph Hellwig
@ 2014-02-03 20:22 ` Mikulas Patocka
0 siblings, 0 replies; 7+ messages in thread
From: Mikulas Patocka @ 2014-02-03 20:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: James Bottomley, Jens Axboe, Alasdair G. Kergon, Mike Snitzer,
dm-devel, David S. Miller, linux-ide, linux-scsi, linux-kernel,
Neil Brown, linux-raid, linux-mm
On Mon, 3 Feb 2014, Christoph Hellwig wrote:
> On Fri, Jan 31, 2014 at 03:20:17AM -0500, Mikulas Patocka wrote:
> > So if you think you can support 16TiB devices and leave pgoff_t 32-bit,
> > send a patch that does it.
> >
> > Until you make it, you should apply the patch that I sent, that prevents
> > kernel lockups or data corruption when the user uses 16TiB device on
> > 32-bit kernel.
>
> Exactly. I had actually looked into support for > 16TiB devices for
> a NAS use case a while ago, but when explaining the effort involves
> the idea was dropped quickly. The Linux block device is too deeply
> tied to the pagecache to make it easily feasible.
The memory management routines use pgoff_t, so we could define pgoff_t to
be 64-bit type. But there is lib/radix_tree.c that uses unsigned long as
an index into the radix tree - and pgoff_t is cast to unsigned long when
calling the radix_tree routines - so we'd need to change lib/radix_tree to
use pgoff_t.
Then, there may be other places where pgoff_t is cast to unsigned long and
they are not trivial to find (one could enable some extra compiler
warnings about truncating values when casting them, but I suppose, this
would trigger a lot of false positives). This needs some deep review by
people who designed the memory management code.
Mikulas
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2014-02-03 20:22 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <alpine.LRH.2.02.1401301531040.29912@file01.intranet.prod.int.rdu2.redhat.com>
[not found] ` <1391122163.2181.103.camel@dabdike.int.hansenpartnership.com>
[not found] ` <alpine.LRH.2.02.1401301805590.19506@file01.intranet.prod.int.rdu2.redhat.com>
[not found] ` <1391125027.2181.114.camel@dabdike.int.hansenpartnership.com>
2014-01-31 0:20 ` [PATCH] block devices: validate block device capacity Mikulas Patocka
2014-01-31 1:43 ` James Bottomley
2014-01-31 2:43 ` Mikulas Patocka
2014-01-31 5:45 ` James Bottomley
2014-01-31 8:20 ` Mikulas Patocka
2014-02-03 8:15 ` Christoph Hellwig
2014-02-03 20:22 ` Mikulas Patocka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox