Opened 15 years ago
Closed 13 years ago
#393 closed defect (fixed)
shp2pgsql returns "fseek(-xxx) failed on DBF file." for large (>2GB) DBF files
Reported by: | maximeguillaud | Owned by: | pramsey |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.0.0 |
Component: | utils/loader-dumper | Version: | 1.5.X |
Keywords: | shp2pgsql fseek failed large file | Cc: | arencambre |
Description
Running shp2pgsql on large files fails when the .DBF is over 231 bytes. It throws multiple messages such as
"fseek(-2124469057) failed on DBF file.
"
(with the quoted offset changing).
An example of such large files is the europe_highway.shp in the archive at http://downloads.cloudmade.com/europe/europe.shapefiles.zip. These files are OpenStreetMap data.
The attached patch fixes the problem for me on a Linux/amd64 platform.
Attachments (3)
Change History (24)
by , 15 years ago
Attachment: | int_to_long.diff added |
---|
comment:1 by , 15 years ago
comment:2 by , 15 years ago
Milestone: | PostGIS 1.4.2 → PostGIS 1.4.3 |
---|
comment:3 by , 14 years ago
I've applied the patch as a temporary measure into the 1.5 branch at r5787. This still needs a clean resolution for trunk and going forward.
comment:5 by , 14 years ago
Cc: | added |
---|---|
Milestone: | PostGIS 1.4.3 |
Version: | → 1.5.X |
Also getting this with 1.5.2 on Windows 7 32 bit. Windows Explorer says my DBF is 2,147,484,046 bytes. Dividing by 1024 three times comes out to 2.0000003GB. Hmmm...
comment:7 by , 14 years ago
Component: | postgis → loader/dumper |
---|
comment:8 by , 13 years ago
Milestone: | PostGIS 1.5.3 → PostGIS 2.0.0 |
---|
comment:9 by , 13 years ago
http://bugzilla.maptools.org/show_bug.cgi?id=1463 just got marked as resolved, and it may be related to this.
comment:10 by , 13 years ago
Regarding previous patches, changing "int" to "long" has no effect on Linux at least, since both are 4 bytes with a 32-bit OS and 8 bytes with a 64-bit OS. One can write a simple C program to printf("sizeof(int): %d, sizeof(long): %d\n", sizeof(int), sizeof(long)) to verify.
My attached 6-line large_dbf.v1.patch fixes my problems writing > 2GB DBF files with shp2pgsql on 32-bit Linux.
- Add -D_FILE_OFFSET_BITS=64 to Makefile's CFLAGS, which makes off_t an unsigned 8 byte integer (equivalent to unsigned long long type).
- In shapefil.h, change typedef of SAOffset from unsigned long (which is 4-byte on 32-bit systems) to off_t.
- In safileio.c, change fseek call to fseeko, and cast offset to off_t rather than long. fseek always takes a long, and a long can't be made 8 bytes on a 32-bit system.
by , 13 years ago
Attachment: | large_dbf.v1.patch added |
---|
comment:11 by , 13 years ago
Please keep in mind that _FILE_OFFSET_BITS is nonportable; it isn't used on BSDs, where off_t is simply always int64_t. But using off_t and lseeko instead of long seems entirely sensible. So probably the -D_FILE_OFFSET_BITS=64 needs to be wrapped in a configure test to be added only on operating systems where it makes sense.
comment:12 by , 13 years ago
Given that we recently re-synced with upstream, I don't really want us to be maintaining a fork once again :( How does upstream shapelib handle this issue? I'd much rather you got the patches tested/accept there and then we can simply re-sync.
comment:13 by , 13 years ago
I filed a bug in ShapeLib's bug tracker: http://bugzilla.maptools.org/show_bug.cgi?id=2359
w.r.t. _FILE_OFFSET_BITS nonportability, to be less intrusive, instead of specifying -D_FILE_OFFSET_BITS=64 in the Makefile's CFLAGS, we could just #define _FILE_OFFSET_BITS 64 at the top of pgsql2shp.c, before any #includes as described here: http://www.gnu.org/software/libc/manual/html_node/Feature-Test-Macros.html
comment:14 by , 13 years ago
Shapelib 1.3.0b has this
#ifndef SAOffset typedef unsigned long SAOffset; #endif
which seems reasonable enough to be. An unsigned integer gets us to 4GB, so we max out 32 bit systems. The definition of off_t at least on OSX is of a signed value, so that means we could support large files on 64bit but not on 32bit. Not good enough? As Mark says, our best bet is just to track shapelib closely.
comment:15 by , 13 years ago
Poking around shplib, it looks like we still have some patches it does not, around date handling in particular.
comment:16 by , 13 years ago
I've updated to the latest shplib in trunk at r8919.
I wonder, does this bug refer to trunk anymore? We should go up quite large using "unsigned long" as our offset, and work on both 32 and 64 bits. Perhaps this problem is only a 1.5 problem now. Comments?
comment:17 by , 13 years ago
OSX is not a problem, since off_t and unsigned long are both 64bit integers on 32 bit Macs: $ uname -a && cat sizeof_types.c && gcc sizeof_types.c && ./a.out Darwin laptop 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 #include <stdio.h> int main(int argc, char argv) { printf("sizeof(off_t): %lu, sizeof(unsigned long): %lu, sizeof(int): %lu\n", sizeof(off_t), sizeof(unsigned long), sizeof(int)); }
32-bit Linux will still have the 2GB limitation against trunk, since off_t (by default, without _FILE_OFFSET_BITS=64) and unsigned long are both 32bit integers: $ uname -a && cat sizeof_types.c && gcc sizeof_types.c && ./a.out Linux host 2.6.30-1-686 #1 SMP Sat Aug 15 19:11:58 UTC 2009 i686 GNU/Linux #include <stdio.h> int main(int argc, char argv) { printf("sizeof(off_t): %lu, sizeof(unsigned long): %lu, sizeof(int): %lu\n", sizeof(off_t), sizeof(unsigned long), sizeof(int)); } sizeof(off_t): 4, sizeof(unsigned long): 4, sizeof(int): 4
comment:18 by , 13 years ago
Last line of OSX output got truncated, here it is again:
$ uname -a && cat sizeof_types.c && gcc sizeof_types.c && ./a.out Darwin laptop 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun 7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 #include <stdio.h> int main(int argc, char **argv) { printf("sizeof(off_t): %lu, sizeof(unsigned long): %lu, sizeof(int): %lu\n", sizeof(off_t), sizeof(unsigned long), sizeof(int)); } sizeof(off_t): 8, sizeof(unsigned long): 8, sizeof(int): 4
comment:19 by , 13 years ago
Since SAOffset is unsigned long, won't we end up with a 4GB limit on 32bit linux? That's still 2x bigger than our limit before... should we aim yet higher?
comment:20 by , 13 years ago
Paul, in safileio.c's SADFSeek, the SAOffset value gets casted to (signed) long, making the limit 2GB on 32bit Linux.
SAOffset SADFSeek( SAFile file, SAOffset offset, int whence ) { return (SAOffset) fseek( (FILE *) file, (long) offset, whence ); }
by , 13 years ago
Attachment: | seeko.patch added |
---|
How about this? Wonder what platforms fseeko does nto exist on...
comment:21 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
OK, hearing nothing more I've taken the original dfurhry patch and wrapped in some autoconf testing for fseeko to hopefully guard platforms that don't have it.
I've also pulled our shapelib up to the b3 release, but it still has custom stuff for our date and boolean handling. I've submitted patches on those changes into the shapelib tracker, so maybe some day in the future we'll be able to track a clean shapelib.
In trunk at r8967.
I think that using fseeko and off_t is a better approach, hopefully one that can be used even on 32-bit operating systems, if we are careful.