Opened 13 years ago

Closed 13 years ago

#1465 closed defect (fixed)

loader/Latin1 regression failure with database SQL_ASCII encoding

Reported by: gdt Owned by: mcayland
Priority: medium Milestone: PostGIS 2.0.0
Component: utils/loader-dumper Version: master
Keywords: Cc:

Description

$ ./run_test loader/Latin1
Creating spatial db postgis_reg 
TMPDIR is /tmp/pgis_reg_20444
PATH is /home/gdt/ccache:/usr/pkg/java/sun-6/bin:/usr/amanda/bin:/usr/amanda/sbin:/home/gdt/bin:/home/gdt/bin/i386-NetBSD:/usr/y0/sbin:/usr/y0/bin:/usr/pkg/sbin:/usr/pkg/bin:/usr/X11R7/bin:/usr/local/sbin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/pilot/bin

 PostgreSQL 8.4.10 on i386--netbsdelf, compiled by GCC gcc (GCC) 4.1.3 20080704 prerelease (NetBSD nb3 20111107), 32-bit
 Postgis 2.0.0SVN - 2012-01-13 20:25:58
   GEOS: 3.3.2-CAPI-1.7.2
   PROJ: Rel. 4.7.1, 23 September 2009

Running tests

 loader/Latin1 ... failed ( wkb test: running shp2pgsql output: /tmp/pgis_reg_20444/loader.err)
 uninstall .. ok (3697)

Run tests: 2
Failed: 1
SET CLIENT_ENCODING TO UTF8;
SET STANDARD_CONFORMING_STRINGS TO ON;
BEGIN;
CREATE TABLE "loadedshp" (gid serial,
"address" varchar(32));
ALTER TABLE "loadedshp" ADD PRIMARY KEY (gid);
INSERT INTO "loadedshp" ("address") VALUES ('Tårneby in Våler I Solør kommune');
COMMIT;
SET
SET
BEGIN
psql:/tmp/pgis_reg_20444/loader.out:5: NOTICE:  CREATE TABLE will create implicit sequence "loadedshp_gid_seq" for serial column "loadedshp.gid"
CREATE TABLE
psql:/tmp/pgis_reg_20444/loader.out:6: NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "loadedshp_pkey" for table "loadedshp"
ALTER TABLE
psql:/tmp/pgis_reg_20444/loader.out:7: ERROR:  value too long for type character varying(32)

The issue might be that the encoding of the string is more than 32:

regress > wc Tårneby in Våler I Solør kommune

1 6 36

(which is 35, since I added a return).

I don't know enough about beyond-ASCII character sets to understand what's wrong, but the loader-generated SQL looks ok.

Is varchar(N) supposed to handle N bytes, or N possibly-wide characters?

Change History (12)

comment:1 by gdt, 13 years ago

Editing the loader.out and re-running, I find that with the column declared as varchar(35) the insert works, and as varchar(34) it errors. So this seems to be a postgresql issue (as compiled, i18n is hard it seems), not a postgis issue, and I'm looking into it there.

comment:2 by gdt, 13 years ago

I reinitialized postgresql with LC_ALL=en_US.ISO8859-1, and the string can be inserted, but on output it differs from expected. I'm 99% sure the output is 8859-1 (bytes match what's in Latin1.dbf). But the expected output has 35 bytes, and appears to be UTF-8 encoded.

With this change, the test passes

Index: regress/loader/Latin1.select.sql
===================================================================
--- regress/loader/Latin1.select.sql	(revision 8814)
+++ regress/loader/Latin1.select.sql	(working copy)
@@ -1 +1,2 @@
+set client_encoding to UTF8;
 SELECT * FROM loadedshp;

So there seem to be two issues:

inserting UTF8 into SQL_ASCII requires a bigger varchar to account for the expansion

Unless there's a requirement for the database to have a particular encoding, loading/dumping both have to set the encoding.

comment:3 by gdt, 13 years ago

If I nuke the pgsql database dir and recreate with LC_ALL=en_US.UTF-8, then the regression test passes without my modification. But even with the modification, creating the database without LC_ALL set (and hence SQL_ASCII encoding) causes the test to fail with value too long (because there are 35 bytes in the UTF-8 encoding of the test address, which don't fit in varchar(32)).

comment:4 by mcayland, 13 years ago

Right, so we don't use PostgreSQL to do the encoding conversion but use iconv instead. Hence we force everything to UTF-8 encoding in the output and indicate this to PostgreSQL.

Now I can't say that I have any shapefiles lying around without 7-bit ASCII names, so it wouldn't surprise me if there are errors in this area. I'll try and take a look at this, however I'm fairly short on time this week.

comment:5 by gdt, 13 years ago

To clarify, there are two separate bugs in the regression test:

  • selecting the string doesn't set the encoding, so e.g. if the native encoding is ISO8859-1, the ISO8859-1-encoded version of that address doesn't mach the UTF-8 version. This is easy to fix as above.
  • If the database is SQL_ASCII, then 35 bytes that are the UTF-8 version of a 32-character address doesn't fit in a varchar(32). The fix is either to use a shorter string, or a longer varchar, or to explicitly create the regression database with a particular encoding.

comment:6 by strk, 13 years ago

Creating the DB with a known encoding seems good to me. We really need to be more in control of the environment our testsuite runs in, as much as I'd like to create a whole cluster to enforce such control.

@gdt: the DB creation code is in regress/run_test, if you want to come up with a patch. Make sure it works from 8.4 to 9.2

comment:7 by bnordgren, 13 years ago

This is a duplicate of #1169. However, this ticket has better information so it may be better to say that #1169 is a duplicate of this.

comment:8 by gdt, 13 years ago

Summary: loader/Latin1 regression failure (NetBSD, postgresql 8.4.10)loader/Latin1 regression failure with database SQL_ASCII encoding

With the following, the loader/Latin1 test passes even when pgsql was initialized with no locale and hence is SQL_ASCII. Note that I've chosen to set the encoding to ISO8859-1 instead of UTF-8, to force showing the translation on load and query works too.

(Of course, the commented-out --locale line is not needed; that fails on my system and I'm not sure why -- it's left as something to try.)

Index: regress/run_test
===================================================================
--- regress/run_test	(revision 8924)
+++ regress/run_test	(working copy)
@@ -559,7 +559,8 @@
 		# ON_ERROR_STOP is used by psql to return non-0 on an error
 		_psql_opts="--no-psqlrc --variable ON_ERROR_STOP=true"
 
-		createdb --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log
+		# --locale en_US.UTF-8
+		createdb --encoding=ISO8859-1 --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log
 		createlang plpgsql "${DB}" >> ${TMPDIR}/regress_log
 
 		# Count database objects before installing anything
Index: regress/loader/Latin1.select.sql
===================================================================
--- regress/loader/Latin1.select.sql	(revision 8924)
+++ regress/loader/Latin1.select.sql	(working copy)
@@ -1 +1,2 @@
+SET CLIENT_ENCODING to UTF8;
 SELECT * FROM loadedshp;

comment:9 by gdt, 13 years ago

I believe the --locale line fails on my system because LC_COLLATE is not implemented well enough to satisfy postgresql. But just setting the encoding is enough for the test, because it's just about encoding round trips and no actual locale processing is needed.

comment:10 by pramsey, 13 years ago

Does forcing the encoding to UTF-8 work just as well? I don't want to test against ISO8859-1 when that's not a best practice, we prefer everyone to use UTF-8 encodings.

comment:11 by gdt, 13 years ago

Yes, forcing to UTF-8 works too (with the set client encoding in the select). Clean patch:

Index: regress/run_test
===================================================================
--- regress/run_test	(revision 9084)
+++ regress/run_test	(working copy)
@@ -559,7 +559,7 @@
 		# ON_ERROR_STOP is used by psql to return non-0 on an error
 		_psql_opts="--no-psqlrc --variable ON_ERROR_STOP=true"
 
-		createdb --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log
+		createdb --encoding=UTF-8 --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log
 		createlang plpgsql "${DB}" >> ${TMPDIR}/regress_log
 
 		# Count database objects before installing anything
Index: regress/loader/Latin1.select.sql
===================================================================
--- regress/loader/Latin1.select.sql	(revision 9084)
+++ regress/loader/Latin1.select.sql	(working copy)
@@ -1 +1,2 @@
+SET CLIENT_ENCODING to UTF8;
 SELECT * FROM loadedshp;

comment:12 by pramsey, 13 years ago

Resolution: fixed
Status: newclosed

Super, r9088

Note: See TracTickets for help on using tickets.