Opened 4 years ago
Closed 4 years ago
#4826 closed defect (fixed)
Geocoder gives goofy resutls for 1 Main St, Hanover, MA
Reported by: | robe | Owned by: | robe |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 3.1.2 |
Component: | tiger geocoder | Version: | 3.1.x |
Keywords: | Cc: |
Description (last modified by )
tiger geocoder as mentioned on IRC
http://irclogs.geoapt.com/postgis/%23postgis.2021-01-05.log
sorry for the repeat from last night, but if anyone's here and knows where I can look for an answer - my TIGER PostGIS setup is returning weird results? when I run geocode('1 Main St, Hanover, MA 02239') it returns a good result, but when I skip the zip code, it returns a result in Andover, which is obviously wrong even though the correct result does exist
One observation is that Hanover is not in the place table, but is in the cousub table.
-- yields right answer SELECT * FROM geocode('1 Main St, Hanover, MA 02339',1); /** (1,,Main,St,,,Hanover,MA,02339,t,,) 0101000020AD1000007FB2EFD615B651C06A1D3D91020F4540 0 **/
-- yields wrong answer SELECT * FROM geocode('1 Main St, Hanover, MA',1); /** (1,E,Main,St,,,Ware,MA,01082,t,,) 0101000020AD1000002FB5DC320F0F52C0E0E8EF375B214540 10 **/
Change History (12)
comment:1 by , 4 years ago
Description: | modified (diff) |
---|
comment:2 by , 4 years ago
comment:3 by , 4 years ago
Okay this is a more generic query that should handle all these. I put in a custom table to distinguish from main so in case it causes undesirable affects you know which ones to monkey with. If this works well I'll incorporate into the main prep logic add add as part of the main state lookups.
It basically adds in all the county sub areas zips that aren't already present. So should handle all cases where the place is a county sub.
CREATE TABLE tiger_data.custom_zip_lookup_base_custom( CONSTRAINT pk_custom_zip_state_loc_city PRIMARY KEY (zip, state, county, city, statefp)) INHERITS (tiger.zip_lookup_base); INSERT INTO tiger_data.custom_zip_lookup_base_custom(zip, state, county, city, statefp) WITH csub AS (SELECT DISTINCT COALESCE(me.zipl, me.zipr) AS zip, s.stusps AS state, c.name As county, cs.name AS city, cs.statefp FROM tiger_data.ma_faces AS f INNER JOIN tiger.county AS c ON (c.countyfp = f.countyfp AND c.statefp = f.statefp) INNER JOIN tiger_data.ma_cousub AS cs ON (f.cousubfp = cs.cousubfp AND cs.countyfp = c.countyfp AND cs.statefp = c.statefp) INNER JOIN tiger.state As s ON s.statefp = f.statefp INNER JOIN tiger_data.ma_edges AS me ON me.tfidl = f.tfid WHERE (me.zipl IS NOT NULL or me.zipr IS NOT NULL) ) SELECT csub.zip, csub.state, csub.county, csub.city, csub.statefp FROM csub LEFT JOIN tiger.zip_lookup_base AS zb ON (csub.city =zb.city AND csub.statefp = zb.statefp AND zb.zip = csub.zip) WHERE zb.zip IS NULL;
comment:4 by , 4 years ago
Thanks for the quick turnaround! I think the city now works correctly, but the result as a whole becomes a mix of the old (incorrect) result and the new one. e.g.
# SELECT ST_X(ST_SnapToGrid(g.geomout, 0.00001)) As longitude, ST_Y(ST_SnapToGrid(g.geomout, 0.00001)) As latitude,addy FROM geocode('1 Main St, Hanover, MA',1) as g; longitude | latitude | addy --------------------+----------+------------------------------------- -70.84437000000001 | 42.15834 | (1,,Main,St,,,Hanover,MA,02061,t,,)
The city looks right, but the coordinates and zip code point to what I'm guessing would be the result if the custom zip lookup table wasn't present - Jacobs Lane, Norwell, MA.
(ST_SnapToGrid also doesn't seem to abbreviate the result properly the way I'm using it, but that's a question I'll address separately)
comment:5 by , 4 years ago
Reverse engineering your solution a bit, I think I see where this might go wrong, though I don't yet have a clear understanding of how to solve it. I think the issue is that cities can have multiple zip codes and having the wrong one for the street can give a result that's in a different city. e.g. for the example we've been looking at, I see this:
# SELECT * FROM zip_lookup_base WHERE city = 'Hanover'; zip | state | county | city | statefp -------+-------+------------+---------+--------- ... Unrelated rows skipped ... 02061 | MA | Plymouth | Hanover | 25 02239 | MA | Plymouth | Hanover | 25 02339 | MA | Plymouth | Hanover | 25
So Hanover MA is partially in 3 different zip codes (and shares 02061 with Norwell, which is the city that has a 1 Main St at that zip code). Hanover's 1 Main St is in 02339 and the correct result is returned if 02339 (or even 02239, which I can't find on a map) is explicitly passed. If 02061 is passed explicitly or implicitly via the new lookup table, the result comes back as Norwell.
I believe this means that you can't have a zip->city lookup with the current geocode function because it will only be correct for some of the addresses. Slow solutions would be to look the address up with each zip code and then return the one with the matching city, or to have a lookup table of city,street->zip. Adjusting geocode() to use the zip code to find the proper city, but then somehow correctly adjust the zip code during the search sounds like a possible solution if the data allows it.
If having a big lookup table ends up taking up 2x the space, that's not a big deal to me as long as it returns the correct results. Likewise, if there's a short-term way to do 2 sequential lookups that would get me the correct result, I could make that work as well, but hopefully you have a better idea of how to wrangle this data to get the desired result.
comment:6 by , 4 years ago
I actually do see a correct result show up in geocode, when I print all the results:
# SELECT rating, ST_X(g.geomout), ST_Y(g.geomout), addy from geocode('1 Main st, Hanover, MA', 100) as g; rating | st_x | st_y | addy --------+--------------------+--------------------+--------------------------------------------- 3 | -70.84580124395251 | 42.11770418442778 | (1,,Main,St,,,Norwell,MA,02339,t,,) 3 | -70.84580124395251 | 42.11770418442778 | (1,,Main,St,,,Hanover,MA,02339,t,,) 3 | -70.84437177401414 | 42.15834177639814 | (1,,Main,St,,,Hanover,MA,02061,t,,) 9 | -70.84437177401414 | 42.15834177639814 | (1,,Main,St,,,Norwell,MA,02061,t,,) 10 | -72.59865296286365 | 42.215393856763214 | (1,N,Main,St,,,Chicopee,MA,01075,t,,) 10 | -72.9734286784644 | 42.49264706267212 | (1,W,Main,St,,,Windsor,MA,01026,t,,) 10 | -72.87441852410113 | 42.50110417638284 | (1,E,Main,St,,,Windsor,MA,01070,t,,) 12 | -72.9734286784644 | 42.49264706267212 | (1,W,Main,St,,,Plainfield,MA,01026,t,,) 12 | -72.87441852410113 | 42.50110417638284 | (1,E,Main,St,,,Plainfield,MA,01070,t,,) 13 | -72.87736680839346 | 42.23484564776451 | (1,E,Main,St,,,Huntington,MA,01050,t,,) 14 | -72.9734286784644 | 42.49264706267212 | (1,W,Main,St,,,Worthington,MA,01026,t,,) 14 | -72.9734286784644 | 42.49264706267212 | (1,W,Main,St,,,Cummington,MA,01026,t,,) 14 | -72.9734286784644 | 42.49264706267212 | (1,W,Main,St,,,Chesterfield,MA,01026,t,,) 15 | -72.59865296286365 | 42.215393856763214 | (1,N,Main,St,,,"South Hadley",MA,01075,t,,) (14 rows)
The the result I want is the 2nd one in this case, though I saw it as the 3rd result when set the second argument of geocode to 3. I think the new lookup table will now give an extra result for any city sharing the related zip code, adding the correct location to the results list, but also making it programmatically indistinguishable from the incorrect result.
comment:7 by , 4 years ago
Can you try this:
DELETE FROM tiger_data.custom_zip_lookup_base_custom WHERE zip IN(SELECT zip FROM tiger.zip_lookup_base WHERE tableoid::regclass::text != 'tiger_data.custom_zip_lookup_base_custom');
and then retest. Hopefully that should get rid of all the ones that are unlikely.
comment:8 by , 4 years ago
@robe, that did get rid of ~580 rows, but the top 4 results are still the 4 variations of correct/incorrect city and zip.
comment:9 by , 4 years ago
hmm which one are you testing with? After I ran the delete and tried this:
SELECT ST_X(ST_SnapToGrid(g.geomout, 0.00001)) As longitude, ST_Y(ST_SnapToGrid(g.geomout, 0.00001)) As latitude,addy FROM geocode('1 Main St, Hanover, MA',1) as g;
outputs
-70.84508000000001 42.117270000000005 (1,,Main,St,,,Hanover,MA,02339,t,,)
SELECT ST_X(ST_SnapToGrid(g.geomout, 0.00001)) As longitude, ST_Y(ST_SnapToGrid(g.geomout, 0.00001)) As latitude,addy FROM geocode('1 Main St, Hanover, MA 02339',1) as g;
outputs:
-70.84508000000001 42.117270000000005 (1,,Main,St,,,Hanover,MA,02339,t,,)
comment:10 by , 4 years ago
Try replacing your geocode_address with this one - make sure to prefix with tiger.
[ba2f659c001142687caf05a3999501ecd37e307a/git]
With the above patch, my results for:
SELECT rating, ST_X(g.geomout), ST_Y(g.geomout), addy from geocode('1 Main st, Hanover, MA', 100) as g;
Look like:
3 -70.84508298309264 42.11726584896208 (1,,Main,St,,,Hanover,MA,02339,t,,) 9 -70.84437177401415 42.15834177639816 (1,,Main,St,,,Norwell,MA,02061,t,,) 10 -72.23530265382281 42.260596267836945 (1,E,Main,St,,,Ware,MA,01082,t,,) 11 -72.79730604439531 42.43541221562752 (1,S,Main,St,,,Goshen,MA,01096,t,,) 12 -72.87424511880401 42.500985260983285 (1,E,Main,St,,,Plainfield,MA,01070,t,,) 14 -72.40080978104714 42.276919680561925 (1,S,Main,St,,,Belchertown,MA,01007,t,,) 14 -72.65981893720061 42.328655966035754 (1,S,Main,St,,,Northampton,MA,01062,t,,) 14 -72.40355779373047 42.27997542236744 (1,N,Main,St,,,Belchertown,MA,01007,t,,) 14 -72.3321618003692 42.18801537873117 (1,S,Main,St,,,"Palmer Town",MA,01079,t,,) 15 -72.59183570986102 42.214047537431945 (1,S,Main,St,,,"South Hadley",MA,01075,t,,) 15 -72.73052592739397 42.39331433997838 (1,N,Main,St,,,Williamsburg,MA,01096,t,,) 15 -72.59865296286365 42.215393856763214 (1,N,Main,St,,,"South Hadley",MA,01075,t,,) 21 -70.75985323124343 42.16990860652304 (,,Main,St,,,Scituate,MA,02061,t,,)
comment:11 by , 4 years ago
Milestone: | PostGIS 3.1.1 → PostGIS 3.1.2 |
---|
Adding the following fixes the issue. Main St:
now yields right answer: