Opened 11 years ago
Last modified 11 years ago
#2798 reopened defect
raster2pgsql spins CPU on statement timeout
Reported by: | strk | Owned by: | Bborie Park |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS GDAL |
Component: | raster | Version: | master |
Keywords: | Cc: |
Description
After waiting for an hour that a raster import completed, with CPU coolers screaming as hell, I finally went looking in pg_stat_activity and found that the transaction was aborted due to a timeout:
state | idle in transaction (aborted) query | COPY "timeseq_change00_12" ("rast") FROM stdin;
This occurrence evidently did not stop raster2pgsql which took away my CPU and tried to kill it.
This is with current PostGIS trunk and GDAL 1.11.0
The test dataset is "Indonesia time-sequential primary forest extent and loss" http://glad.geog.umd.edu/indonesia/data2014/index.html
The import command:
raster2pgsql -C -I -s 4326 -Y -t 256x256 \ timeseq_change00_12.tif timeseq_change00_12
Some top(1) extracts:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13229 strk 20 0 222m 47m 2068 R 99.2 1.2 58:00.08 raster2pgsql 13230 strk 20 0 29160 1192 924 S 0.7 0.0 0:22.77 psql 13231 strk 20 0 181m 10m 6340 S 0.7 0.3 0:32.82 postgres
Change History (18)
comment:1 by , 11 years ago
comment:2 by , 11 years ago
NOTE: statement_timeout for the importing user was 5min, so I've been cooking the CPU for 45 minutes after the timeout happened :'(
comment:3 by , 11 years ago
raster2pgsql doesn't do anything to postgresql, it just generates SQL. I'd suggest outputting to a file to see if raster2pgsql is at fault.
comment:4 by , 11 years ago
I will but if that's the case I guess the process would have received a SIGPIPE or something like that, or are you suggesting it really takes over an hour to import that file ? Did you try the import your side ?
comment:5 by , 11 years ago
Priority: | critical → medium |
---|
Confirmed the intensive CPU has nothing to do with postgresql or pipes. It's running at ~100% CPU since 10 minutes now, with output directed to /dev/null
I guess this ticket can be closed as Invalid, but I'd still interested in knowing more about where all that time goes...
Attaching the process via gdb and interrupting arbitrarely shows code in printf:
(gdb) bt #0 0x00007f73800b5f90 in _IO_vfprintf_internal (s=s@entry=0x7fff0fa387a0, format=<optimized out>, format@entry=0x419e78 "%s%s%s", ap=ap@entry=0x7fff0fa388c8) at vfprintf.c:1655 #1 0x00007f7380179544 in ___vsprintf_chk (s=0x3aa3af0 "\230\247B\200s\177", flags=1, slen=18446744073709551615, format=0x419e78 "%s%s%s", args=args@entry=0x7fff0fa388c8) at vsprintf_chk.c:84 #2 0x00007f738017948d in ___sprintf_chk (s=s@entry=0x3aa3af0 "\230\247B\200s\177", flags=flags@entry=1, slen=slen@entry=18446744073709551615, format=format@entry=0x419e78 "%s%s%s") at sprintf_chk.c:32 #3 0x0000000000406c94 in sprintf (__fmt=0x419e78 "%s%s%s", __s=0x3aa3af0 "\230\247B\200s\177") at /usr/include/x86_64-linux-gnu/bits/stdio2.h:33 #4 insert_records (schema=<optimized out>, table=0x21b8370 "\"timeseq_change00_12\"", column=0x21b72e0 "\"rast\"", filename=0x0, file_column_name=0x21c04f0 "\"filename\"", copy_statements=<optimized out>, tileset=tileset@entry=0x7fff0fa38b70, buffer=buffer@entry=0x2197ca0) at raster2pgsql.c:860
comment:6 by , 11 years ago
Resolution: | → invalid |
---|---|
Status: | new → closed |
later break was found in GDAL, so I guess this ticket does not even belong here, but rather in the GDAL driver...
comment:7 by , 11 years ago
Milestone: | → PostGIS GDAL |
---|---|
Resolution: | invalid |
Status: | closed → reopened |
I've restarted the import after lifting the statement_timeout, and here's the final timing:
- 5787179.252 ms for COPY from stdin (1h36m)
- 29042.048 ms for CREATE INDEX (29s)
- 466143.600 ms for AddRasterConstraints (7m46s)
I undertand this may possibly be a GDAL driver issue, but I tought I'd reopen this ticket until sorted out. Will file it on GDAL too.
comment:9 by , 11 years ago
Sandro, there's no GDAL symbol in the above stacktrace. You should break several times to have a representative sample, and it is expected that sometimes it will show GDAL symbols. The performance might depend on if the original TIFF is tiled or not, and how raster2pgsql uses the GDAL API. But one hour is a lot. Did you just try "gdal_translate timeseq_change00_12.tif out.tif -co TILED=YES" ?
comment:10 by , 11 years ago
gdal_translate seems to be lighter on the CPU:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 770 strk 20 0 207m 63m 1004 D 41.8 1.6 1:44.72 gdal_translate
Also, it took 6m15s to complete the conversion to TIF:
[strk@cdb:~/Downloads/change] time gdal_translate timeseq_change00_12.tif out.tif -co TILED=YES Input file size is 172000, 68000 0...10...20...30...40...50...60...70...80...90...100 - done. real 6m15.583s user 0m48.100s sys 1m17.855s
comment:11 by , 11 years ago
gprof reported times:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 93.66 0.59 0.59 rt_raster_to_hexwkb 1.59 0.60 0.01 rt_band_get_data
comment:13 by , 11 years ago
I think the source of the slowdown is the chunking of the source raster to 256x256 tiles. For each tile, rt_raster_to_hexwkb is called.
gdal_translate is faster because it isn't converting data to hex. Everything stays binary.
Based upon that gprof, we'd have to see what we can do about improving hex conversion performance.
comment:14 by , 11 years ago
Better profile:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 50.43 2.34 2.34 393797216 0.00 0.00 deparse_hex 38.15 4.11 1.77 2048 0.00 0.00 rt_raster_to_hexwkb 9.27 4.54 0.43 parse_hex
comment:15 by , 11 years ago
I'm working on inlining the hex conversion, but I think there's also a problem of more allocations/deallocations than needed. Also, why don't we just write the output to stdout as it comes rather than growing that STRINGBUFFER ?
comment:16 by , 11 years ago
I think the original idea behind the STRINGBUFFER was to minimize the number of flushes, especially noticeable when pointing to a file.
comment:17 by , 11 years ago
With inlined hex conversion, and line-by-line profiling:
Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 53.88 0.63 0.63 rt_raster_to_hexwkb (rt_wkb.c:697 @ 41ce8b) 14.66 0.80 0.17 rt_raster_to_hexwkb (rt_wkb.c:695 @ 41ce66) 11.21 0.93 0.13 rt_raster_to_hexwkb (rt_wkb.c:695 @ 41ce9d) 5.60 0.99 0.07 rt_raster_to_hexwkb (rt_wkb.c:697 @ 41ce7c)
The lines 695 and 697 are these:
695- while (wkbsize--) { 696- uint8_t v = *iptr++; 697: *optr++ = hexchar[v>>4]; 698: *optr++ = hexchar[v & 0x0F];
It's interesting that v>>4 takes much more time than v & 0x0F. I wonder if it would make sense to make a 255 items table to speed it up further.
As for flushes, I think you can control stdout buffering strategy.
comment:18 by , 11 years ago
so, this command:
time ./raster2pgsql -t '256x256' ~/Downloads/HYP_LR/HYP_LR.tif unused > /dev/null Processing 1/1: /home/strk/Downloads/HYP_LR/HYP_LR.tif
Without the inlining of hex conversion:
real 0m8.120s user 0m7.012s sys 0m0.795s
With the inlining:
real 0m7.703s user 0m6.651s sys 0m0.868s
It's not such a big improvement.
Excluding the hex conversion completely brings down the time to:
real 0m2.871s user 0m2.076s sys 0m0.642s
Forgot to mention, the potgresql log extract: