Opened 13 years ago
Closed 13 years ago
#1382 closed defect (fixed)
Some addresses take a long time to geocode or normalize
Reported by: | raviada | Owned by: | robe |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.0.0 |
Component: | tiger geocoder | Version: | master |
Keywords: | Cc: |
Description
Regina,
Thanks for helping me on this, normalize_address return slow but when I turn in the debug on, the out seems normal, because the address was parsed correctly and all fields got populated with the right address components.
I am attaching some of the address here which were taking a little longer. This is the output from postgis_full_version().
"POSTGIS="2.0.0SVN" GEOS="3.2.2-CAPI-1.6.2" PROJ="Rel. 4.7.1, 23 September 2009" LIBXML="2.7.8" USE_STATS"
Thanks Ravi Ada
4770 RTE. 209, ACCORD, NY 12404 509 METTACAHONTS ROAD, ACCORD, NY 12404 13859 PROGRESS BLVD STE 100, ALACHUA, FL 326159403 17330 113TH AVE, ADDISLEIGH PARK, NY 114334003 179TH ST, ADDISLEIGH PARK, NY 114341413 3510 HWY 186, ALIX, AR 728200061 1250 S 18TH ST, AMELIA ISLAND, FL 320341902 4780 NE 105TH PL, ANTHONY, FL 326173156 233 ROUTE 6, COLUMBIA, CT 062371125 11623 NE 11TH PL, BISCAYNE PARK, FL 331616725 2410 HIGHWAY 65 N, MC GEHEE, AR 716549437 5430 NW 33RD AVE, FT LAUDERDALE, FL 333096349
Attachments (1)
Change History (12)
by , 13 years ago
Attachment: | sample_addresses.txt added |
---|
comment:1 by , 13 years ago
Status: | new → assigned |
---|---|
Version: | 1.5.X → trunk |
follow-up: 6 comment:2 by , 13 years ago
comment:3 by , 13 years ago
Thanks Regina for looking into this. Appreciate your help. I checked the normalize_address.sql and found out this. Seems like I have the older version. What is your recommendation on upgrading all of the functions and scripts, just in case other scripts may have been older versions too.
--$Id: normalize_address.sql 7801 2011-09-01 14:58:04Z robe $-
Thanks Ravi Ada
comment:5 by , 13 years ago
Ravi, 1) Download the latest postgis 2.0 tar ball or if you have svn svn update your postgis. 2) I fyou are on linux/unix edit and run the tiger_2010/upgrade_geocoder.sh, for windows there is a similar upgrade_geocoder.bat file.
Then you should be up to date.
Brian, Yes I was using mostly limit 1 but not all the time. Limit 1 tends to perfrom better than all but it varies a lot how much better it is but the speed is usually equal to or better.
comment:6 by , 13 years ago
Regina, I upgraded to the new build which is Revision 8450. the output from postgis_full_version is "POSTGIS="2.0.0SVN" GEOS="3.2.2-CAPI-1.6.2" PROJ="Rel. 4.7.1, 23 September 2009" LIBXML="2.7.8" USE_STATS"
Also I see --$Id: normalize_address.sql 8252 2011-11-29 08:49:06Z robe $- in normalize_address function.
select geocode('2601 24TH AVE, ASTORIA, NY 111022337'); returns two rows Total query runtime: 20943 ms. 2 rows retrieved.
"("(24th,Ave,""New York"",NY,11204,t)",0101000020AD10000036CF279D807E52C0511C69D0424E4440,21)" "("(24th,Ave,""New York"",NY,11214,t)",0101000020AD100000FA12A7B3417F52C0699E6C55CF4C4440,21)"
I think it is still a problem in using the geocode function. normalize function is now returning real fast, in 30-60 ms. Geococe is the one taking 20 secs. All NY addresses are like this.
Thanks Ravi Ada
comment:7 by , 13 years ago
select geocode_address(normalize_address('2734 21ST ST, ASTORIA, NY 111023653'),1)
"("(2734,E,21st,St,""New York"",NY,11235,t)",0101000020AD100000097F608BA97C52C06D6463B6F24A4440,13)" "("(273421st,St,""Niagara Falls"",NY,14305,t)",0101000020AD100000625E1B7D62C253C06AD712AC798E4540,17)" "("(21st,St,Troy,NY,12180,t)",0101000020AD100000AA30B0E8D16A52C09CD542B2EC5E4540,18)"
NOTICE: Bad zip newzip range: '{}' NOTICE: Ignore new zip range that is bad too: '{}' NOTICE: Zip range based on only considering city: '{}' NOTICE: stmt: WITH a AS
( SELECT *
FROM (SELECT f.*, ad.side, ad.zip, ad.fromhn, ad.tohn,
RANK() OVER(ORDER BY diff_zip(ad.zip,'111023653') + CASE WHEN lower(f.name) = lower('21ST') THEN 0 ELSE levenshtein_ignore_case(f.name, lower('21ST') ) END +
levenshtein_ignore_case(f.fullname, lower('21ST' ' ' COALESCE('St',)) ) + CASE WHEN (greatest_hn(ad.fromhn,ad.tohn) % 2)::integer = ('2734' % 2)::integer THEN 0 ELSE 1 END + CASE WHEN '2734'::integer BETWEEN least_hn(ad.fromhn,ad.tohn) AND greatest_hn(ad.fromhn, ad.tohn)
THEN 0 ELSE 4 END + CASE WHEN lower('St') = lower(f.suftypabrv) OR lower('St') = lower(f.pretypabrv) THEN 0 ELSE 1 END + rate_attributes(NULL, f.predirabrv, '21ST', f.name , 'St', suftypabrv , NULL, sufdirabrv, prequalabr) )
As rank
FROM featnames As f INNER JOIN addr As ad ON (f.tlid = ad.tlid)
WHERE '36' = f.statefp AND '36' = ad.statefp
AND lower(f.name) = lower('21ST') AND ( ad.zip = ANY('{}'::varchar[]) ) ) AS foo ORDER BY rank LIMIT 3 )
SELECT * FROM (
SELECT DISTINCT ON (sub.predirabrv,sub.fename,COALESCE(sub.suftypabrv, sub.pretypabrv) ,sub.sufdirabrv,sub.place,s.stusps,sub.zip) sub.predirabrv as fedirp, sub.fename, COALESCE(sub.suftypabrv, sub.pretypabrv) as fetype, sub.sufdirabrv as fedirs, sub.place , s.stusps as state, sub.zip as zip, interpolate_from_address('2734', sub.fromhn, sub.tohn, sub.the_geom, sub.side) as address_geom, sub.sub_rating + least(coalesce(diff_zip('111023653' , sub.zip),0), 10)::integer + coalesce(levenshtein_ignore_case('ASTORIA', sub.place),5) as sub_rating, sub.exact_address as exact_address, sub.tohn, sub.fromhn FROM ( SELECT tlid, predirabrv, COALESCE(b.prequalabr ' ', ) b.name As fename, suftypabrv, sufdirabrv, fromhn, tohn, side, zip, rate_attributes(NULL, predirabrv, '21ST', b.name , 'St', suftypabrv , NULL, sufdirabrv, prequalabr) + CASE WHEN '2734'::integer IS NULL OR b.fromhn IS NULL THEN 20 WHEN '2734'::integer >= least_hn(b.fromhn, b.tohn) AND '2734'::integer <= greatest_hn(b.fromhn,b.tohn) AND ('2734'::integer % 2) = (to_number(b.fromhn,'99999999') % 2)::integer THEN 0 WHEN '2734'::integer >= least_hn(b.fromhn,b.tohn) AND '2734'::integer <= greatest_hn(b.fromhn,b.tohn) THEN 2 ELSE ((1.0 - (least_hn('2734'::text,least_hn(b.fromhn,b.tohn)::text)::numeric / (greatest(1,greatest_hn('2734'::text,greatest_hn(b.fromhn,b.tohn)::text))) )) * 5)::integer + 5 END as sub_rating,'2734'::integer >= least_hn(b.fromhn,b.tohn) AND '2734'::integer <= greatest_hn(b.fromhn,b.tohn) AND ('2734' % 2)::numeric::integer = (to_number(b.fromhn,'99999999') % 2) as exact_address, b.name, b.prequalabr, b.pretypabrv, b.tfidr, b.tfidl, b.the_geom, b.place FROM
(SELECT a.tlid, a.fullname, a.name, a.predirabrv, a.suftypabrv, a.sufdirabrv, a.prequalabr, a.pretypabrv,
b.the_geom, tfidr, tfidl, a.side , a.fromhn, a.tohn, a.zip, p.name as place
FROM a INNER JOIN edges As b ON (a.statefp = b.statefp AND a.tlid = b.tlid )
INNER JOIN faces AS f ON ('36' = f.statefp AND ( (b.tfidl = f.tfid AND a.side = 'L') OR (b.tfidr = f.tfid AND a.side = 'R' ) )) INNER JOIN place p ON ('36' = p.statefp AND f.placefp = p.placefp )
WHERE a.statefp = '36' AND b.statefp = '36'
) As b
ORDER BY 10 , 11 DESC LIMIT 20
) AS sub
JOIN state s ON ('36' = s.statefp)
ORDER BY 1,2,3,4,5,6,7,9
LIMIT 20) As foo ORDER BY sub_rating, exact_address DESC LIMIT 1
' ', ) | a.name As fename, suftypabrv, sufdirabrv, fromhn, tohn, |
side, a.statefp, zip, rate_attributes($5, a.predirabrv, $2, a.name , $4, a.suftypabrv , $6, a.sufdirabrv, a.prequalabr) + CASE WHEN $1::integer IS NULL OR b.fromhn IS NULL THEN 20 WHEN $1::integer >= least_hn(b.fromhn, b.tohn) AND $1::integer <= greatest_hn(b.fromhn,b.tohn) AND ($1::integer % 2) = (to_number(b.fromhn,'99999999') % 2)::integer THEN 0 WHEN $1::integer >= least_hn(b.fromhn,b.tohn) AND $1::integer <= greatest_hn(b.fromhn,b.tohn) THEN 2 ELSE ((1.0 - (least_hn($1::text,least_hn(b.fromhn,b.tohn)::text)::numeric / greatest(1,greatest_hn($1::text,greatest_hn(b.fromhn,b.tohn)::text)))) * 5)::integer + 5 END as sub_rating,$1::integer >= least_hn(b.fromhn,b.tohn) AND $1::integer <= greatest_hn(b.fromhn,b.tohn) AND ($1 % 2)::numeric::integer = (to_number(b.fromhn,'99999999') % 2) as exact_address, a.name, a.prequalabr, a.pretypabrv FROM featnames a join addr b ON (a.tlid = b.tlid AND a.statefp = b.statefp ) WHERE a.statefp = '36' AND a.mtfcc LIKE 'S%' AND ( soundex($2) = soundex(a.name) OR ( (length($2) > 15 or (length($2) > 7 AND a.prequalabr > ) ) AND lower(a.fullname) LIKE lower(substring($2,1,15)) '%' ) OR numeric_streets_equal($2, a.name) ) ORDER BY 11 LIMIT 20 ) AS sub JOIN edges e ON ('36' = e.statefp AND sub.tlid = e.tlid AND e.mtfcc LIKE 'S%' ) JOIN state s ON ('36' = s.statefp) JOIN faces f ON ('36' = f.statefp AND (e.tfidl = f.tfid OR e.tfidr = f.tfid)) LEFT JOIN zip_lookup_base zip ON (sub.zip = zip.zip AND zip.statefp='36') LEFT JOIN place p ON ('36' = p.statefp AND f.placefp = p.placefp) LEFT JOIN county co ON ('36' = co.statefp AND f.countyfp = co.countyfp) LEFT JOIN cousub cs ON ('36' = cs.statefp AND cs.cosbidfp = sub.statefp co.countyfp f.cousubfp) WHERE ( (sub.side = 'L' and e.tfidl = f.tfid) OR (sub.side = 'R' and e.tfidr = f.tfid) ) ORDER BY 1,2,3,4,5,6,7,9 LIMIT 10
' ', ) | a.name As fename, suftypabrv, sufdirabrv, fromhn, tohn, |
side, a.statefp, zip, rate_attributes($5, a.predirabrv, $2, a.name , $4, a.suftypabrv , $6, a.sufdirabrv, a.prequalabr) + CASE WHEN $1::integer IS NULL OR b.fromhn IS NULL THEN 20 WHEN $1::integer >= least_hn(b.fromhn, b.tohn) AND $1::integer <= greatest_hn(b.fromhn,b.tohn) AND ($1::integer % 2) = (to_number(b.fromhn,'99999999') % 2)::integer THEN 0 WHEN $1::integer >= least_hn(b.fromhn,b.tohn) AND $1::integer <= greatest_hn(b.fromhn,b.tohn) THEN 2 ELSE ((1.0 - (least_hn($1::text,least_hn(b.fromhn,b.tohn)::text)::numeric / greatest(1,greatest_hn($1::text,greatest_hn(b.fromhn,b.tohn)::text)))) * 5)::integer + 5 END as sub_rating,$1::integer >= least_hn(b.fromhn,b.tohn) AND $1::integer <= greatest_hn(b.fromhn,b.tohn) AND ($1 % 2)::numeric::integer = (to_number(b.fromhn,'99999999') % 2) as exact_address, a.name, a.prequalabr, a.pretypabrv FROM featnames a join addr b ON (a.tlid = b.tlid AND a.statefp = b.statefp ) WHERE a.statefp = '36' AND a.mtfcc LIKE 'S%' AND ( soundex($2) = soundex(a.name) OR ( (length($2) > 15 or (length($2) > 7 AND a.prequalabr > ) ) AND lower(a.fullname) LIKE lower(substring($2,1,15)) '%' ) OR numeric_streets_equal($2, a.name) ) ORDER BY 11 LIMIT 20 ) AS sub JOIN edges e ON ('36' = e.statefp AND sub.tlid = e.tlid AND e.mtfcc LIKE 'S%' ) JOIN state s ON ('36' = s.statefp) JOIN faces f ON ('36' = f.statefp AND (e.tfidl = f.tfid OR e.tfidr = f.tfid)) LEFT JOIN zip_lookup_base zip ON (sub.zip = zip.zip AND zip.statefp='36') LEFT JOIN place p ON ('36' = p.statefp AND f.placefp = p.placefp) LEFT JOIN county co ON ('36' = co.statefp AND f.countyfp = co.countyfp) LEFT JOIN cousub cs ON ('36' = cs.statefp AND cs.cosbidfp = sub.statefp co.countyfp f.cousubfp) WHERE ( (sub.side = 'L' and e.tfidl = f.tfid) OR (sub.side = 'R' and e.tfidr = f.tfid) ) ORDER BY 1,2,3,4,5,6,7,9 LIMIT 10
NOTICE: EXECUTE query_base_geo(2734,'21ST','ASTORIA','St',NULL,NULL,'111023653',NULL); NOTICE: DEALLOCATE query_base_geo; Total query runtime: 14172 ms. 3 rows retrieved.
comment:8 by , 13 years ago
Ravi - what sort of Shared Buffer settings do you have in postgresql.conf? working memory ? The default Postgres settings are very minimal
comment:9 by , 13 years ago
Ravi,
Okay that seems to agree with around what I am getting for that address. I get 19 seconds for no limit and 18 seconds for limit. But the address it returns is completely wrong. I'll put in a separate ticket for this. I think it might be the Astoria throwing it off.
since if I run this:
select pprint_addy(addy), ST_AsText(geomout), rating FROM geocode('2601 24TH AVE, NY 111022337',2); -- it returns this: 0 24th Ave, New York, NY 11102 POINT(-73.9184479285714 40.7744440102041) 15
Which is much more accurate.
If I reverse geocode:
-- answer tiger geocoder is returning (with Astoria) SELECT pprint_addy(addy[1]), array_to_string(street, '|') from reverse_geocode('POINT(-73.9750505 40.613036)', true) As g; -- 6148 24th Ave, NY 11204 1404 - 1416 Dahill Rd|2301 - 2399 62nd St
-- to rule out a tiger data issue, I reverse geocoded what google returns. -- though the place seems suspiciously missing, thought that could be a reverse geocode issue.
SELECT pprint_addy(addy[1]), array_to_string(street, '|') As cross_streets from reverse_geocode('POINT(-73.918402 40.774441)', true) As g; --- 2607 24th Ave, NY 11102 2301 - 2399 26th St|2300 - 2398 27th St
comment:11 by , 13 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
I'm closing this ticket out and continuing on #1384.
Ravi, Thanks for the examples. I'm still analyzing these. The ones I've tested the normalize_address functions on returned fairly fast. All under 7ms. So I'm not seeing a normalize speed issue here. If you are you might be running an older version of the geocoder. In the normalize_address code you should see this:
That is the revision number of the latest normalizer. If you don't have it at all or an older number yours is out of date.
1) 179TH ST, ADDISLEIGH PARK, NY 114341413, I think I can improve on the normalizing logic and I might have a ticket for it already.
is just normalizing incorrectly so its putting 179 in the street number and ST as the street name. This is because this address has no street number. So that is why that one is so slow and comes back with wrong answer. It's still a bit slow on my test box took about 15 secs for 2 reasons
When I do this:
a) the street name is short b) I didn't specify a valid street range b) and ADDISLEIGH PARK doesn't match anything in tiger.
2) This one 509 METTACAHONTS ROAD, ACCORD, NY 12404 -- runs fairly fast on my box take 90ms to geocode returning
So the speeds are pretty decent though the address doesn't match. I suspect this is more of a tiger data issue than logic issue. The fact it gives different addresses between limit 1 and none is that to improve speed I have inner limit limitting as well but if there is no perfect match or close to perfect match you run the issue of the gvie me one answer returning slightly worse than the full. I'm not sure there is much I can do about that without compromising speed and the benefit is low.
3) This one select pprint_addy(addy), rating, ST_AsText(geomout)
Did take 20,483 ms and came back with only the street.
The reason is because the address for this is really: 173-30
and our geocode doesn't support that kind of street number yet. It would require the same structural changes as #886. I'll see what I can do about it though as a lot of NY addresses will have this issue. But it wouldn't help you much since you don't have the - in your address.
-- Note to Steve Woodbridge: Would your C normalizer help in this case? If we were to embed it in?