Opened 13 years ago

Closed 7 years ago

#1118 closed defect (duplicate)

Street pretypes that aren't directional prefixes

Reported by: robe Owned by: robe
Priority: medium Milestone: PostGIS Fund Me
Component: tiger geocoder Version: 1.5.X
Keywords: Cc: jmarca, woodbri

Description (last modified by robe)

Example:

Camino del Rio is an example as noted in: http://www.postgis.org/pipermail/postgis-users/2011-July/030152.html

normalize_address doesn't handle these at all and tiger data is inconsistent as to how it represents these.

So for example:

In tiger: Camino del Rio is represented as having

fullname: Cam del Rio

name: del Rio

and pretypabrv: Cam

But in other locations of the world you'll see it listed as

fullname: Cam del Rio

no pretyabrv

The two wasy to handle this would be to introduce a new: norm_addy object -- the pretypabbrv

and another table to store these lookup alternative spellings and changing geocode to handle this new field and changing the normalize_address as well.

OR (2) which I am leaning toward.

have a pretypeabbrv lookup and just during the geocode process, compare tiger street names against all normalized street name forms (e.g. basically replacing any of these if they start at the beginning of the streetname with the abbreviated form and comparing with the normalize_address.

Option 2 seems much easier to implement and will also handle many cases such as the case of St. vs. Saint.

Example to test:

So now:

SELECT pprint_addy(addy), ST_AsText(geomout), rating
FROM geocode('477 Camino del Rio South, San Diego, CA 94115');

Attachments (1)

census_abbreviations.csv (14.4 KB ) - added by jmarca 12 years ago.

Download all attachments as: .zip

Change History (14)

comment:1 by robe, 13 years ago

Description: modified (diff)

actually highways are on this list so I have partially taken care of this in r7641 by gluing this on to front. Solution for things like Camino probably easiest to mark them as highways. Only issue I have now is that if I add Camino to list of street types as and mark as a highway, my normalize cuts of the Rio. This is yet another bug in the normalizer which I have to address.

comment:2 by robe, 13 years ago

More examples of this issue:

1798 PASEO DEL CAJON, PLEASANTON, CA 94566 

Doesn't geocode right but this does: with a bad rating but returns right answer

SELECT pprint_addy(addy), ST_AsText(ST_SnapToGrid(geomout,0.0001)) As coord, rating 
	FROM geocode('1798 PSO DEL CAJON, PLEASANTON, CA 94566',1) As g;

Behavior of pretypabrs is a little different from highway. Both require fixes in the normalize_address so I'll probably add another column in street_typelookup to fix these. Most often if it has a pretype it doesn't have a street type that falls at the end. Though in some cases it can have both.

comment:3 by robe, 13 years ago

refer to #1148 for more examples

comment:4 by robe, 13 years ago

Milestone: PostGIS 2.0.0PostGIS Future

comment:5 by jmarca, 12 years ago

Cc: jmarca added

California examples:

Wrong answer, right question:

select pprint_addy(addy), st_astext(geomout),rating FROM  geocode( 'Via Verde, Dana Point CA');
pprint_addy st_astext rating
Via Verde Ct, Calabasas, CA 91302 POINT(-118.659995686466 34.1275841694006) 41

Right answer, wrong question:

select pprint_addy(addy), st_astext(geomout),rating FROM  geocode( 'Verde Via, Dana Point CA');
pprint_addy st_astext rating
Verde Via, Dana Point, CA 92624 POINT(-117.672816628784 33.4623777015046) 38

Wrong answer, right question:

select pprint_addy(addy), st_astext(geomout),rating FROM  geocode( 'Camino Las Ramblas, San Juan Capistrano CA');
pprint_addy st_astext rating
Lago Cll, Dana Point, CA 92624 POINT(-117.665451952078 33.4629500870658) 60
Lago Cll, San Clemente, CA 92672 POINT(-117.629773684804 33.4331286117088) 68

...

Right answer, wrong question

select pprint_addy(addy), st_astext(geomout),rating FROM  geocode( 'Las Ramblas Camino, San Juan Capistrano CA');
pprint_addy st_astext rating
Cam Las Ramblas, San Juan Capistrano, CA 92675 POINT(-117.662978341711 33.4686616216608) 38
Las Ramblas Dr, Concord, CA 94521 POINT(-121.9494654272 37.9565518802366) 53

Variations on how Census treats text with "Las Ramblas"

select distinct fullname from addrfeat where fullname ~* 'Las Ramblas' limit 10;
fullname
Cam Las Ramblas
Via Las Ramblas
Cll Las Ramblas
Ave Las Ramblas
Las Ramblas Dr
Las Ramblas

by jmarca, 12 years ago

Attachment: census_abbreviations.csv added

comment:6 by woodbri, 12 years ago

Cc: woodbri added

Regina,

These name transformation between full word vs abbreviation should be handled in a lexicon that is then used classify and standardize the various tokens of the street name. Then these can be fixed by adding more entries to the lexicon and not with code changes.

The PAGC tools have both a lexicon for classifying and standardizing street name components and a gazeteer for doing the same to city and state components. There is also a lexicon.csv and gazeteer.csv in that PAGC svn.

comment:7 by robe, 12 years ago

Component: tiger geocoderpagc_address_parser

comment:8 by woodbri, 12 years ago

This is a problem of not standardizing the reference dataset and relying on the existing standardization. This is a process bug, not a code bug. If you take a random address and ask some people to standardize it into components, you will surely get some different results because the people will have a different set of rules in mind. So we take Tiger data which has been standardized by 3300 different counties where it was collect and given to Census and you will not even find consistency within Tiger. So relying on the pre-parsed standardization is the wrong way to approach this problem.

The way to fix this is to load the tiger data, then clump the name attributes into a single string and give it to the standardizer to parse and then save that. When we get a query request, we standardize that using our same standardizer and rules and we match those results against our standardized reference set.

Then we don't care if the standardization is right or wrong, because if it is wrong, it will be wrong in both cases and will still match.

This process also has the benefit that you can analyze those records that failed to standardize because of missing lexicon, gazeteer or rules and add those that we might need to improve the tools over time. This part can be done separate from the automated loading process. I should be done as part of the bug fixing and enhancements to the geocoder over time.

While the pagc address standardizer improves things and proves some easy tool to change the behavior if you don't make this process change. You will have an endless list of bugs like this that have nothing to do with the code. While you might be able to fix some of these with change to lex, gaz and rules you also might be breaking other cases that are not obvious when you make changes. DAMHIK.

I know the plan it to move forward without making this process change, but it should be planned for sometime in the future.

comment:9 by robe, 12 years ago

Yah I was thinking of it in future. I'll ticket that I'm leaning toward using hstore to store the normalized hash for the tiger set possibly only doing it for the obvious ambiguities.

The issue I have with doing it for after load and for all

1) inserting is a lot less painful than updating since updating requires both an insert and delete. So its faster to do on load.

2) Since this is in flux, they'll be a lot of updating going on initially so I don't want to push that on users until things are more stable, plus it complicates update script with update requiring user data changes -- something I kind of want to stay away from until I have my upgrade bullet proof.

3) I actually don't think its necessary to standardize all tiger (I would say about 85% or more of it is fine). For the most part there aren't that many ambiguities and a lot of those would be long and painful to itemize and doing it by lex is probably not the right way.

Clearly for things like Camino etc that would be the right thing.

so I'm thinking more along a hybrid. It would also make my hstore index way shorter and faster to scan if its only the questionable problematic ones that need to be changed. Anyway I'll put in a separate future ticket. For PostGIS 2.1 I would like to change the norm_addy structure since that is part issue that I am mixing pre abbrev with post abbrevs.

comment:10 by robe, 12 years ago

Component: pagc_address_parsertiger geocoder

switching back to a tiger problem since it more an issue of how mornalizer is used rather than which one.

comment:11 by woodbri, 12 years ago

I have moved the design discussion into #2289

comment:12 by robe, 7 years ago

Milestone: PostGIS FuturePostGIS Fund Me

Milestone renamed

comment:13 by robe, 7 years ago

Resolution: duplicate
Status: newclosed
Note: See TracTickets for help on using tickets.