Opened 17 months ago
Closed 17 months ago
#5397 closed enhancement (fixed)
debug_standardize_address
Reported by: | robe | Owned by: | robe |
---|---|---|---|
Priority: | high | Milestone: | PostGIS 3.4.0 |
Component: | pagc_address_parser | Version: | master |
Keywords: | Cc: |
Description
This is a new function for address_standardizer extension that allows you to debug why an address is standardized the way it is.
It's most useful when abbreviations or words can resolve to more than one kind of token - e.g. ST for Street vs. ST for SAINT.
The way the address standardizer works, it generates a standardized list of all tokens in an address and then applies rules to them. In many cases many rules can result in some answer and the address standardizer picks the one with the highest score.
This function allows you to look into the guts of what it is doing.
I'm still working locally fitting all the bits and am almost there. One thought is if I should expose this as just json text (similar to ST_AsGeoJSON) or as a bonified jsonb or json. I don't like that jsonb looses the sorting of the keys so that is one downside.
I went with JSON as Leo's suggestion cause it allows maximum flexibility in turns of structure without having to change the output of the function.
For example
SELECT s::jsonb FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301', 'Boston, MA 02109') AS s;
returns at the moment:
{ "rules": [ { "no": 0, "score": 0.87625, "rule_string": "0 1 2 17 0 -1 1 5 6 17 17", "rule_tokens": [ { "pos": 0, "input-token": "NUMBER", "output-token": "HOUSE", "input-token-code": 0, "output-token-code": 1 }, { "pos": 1, "input-token": "WORD", "output-token": "STREET", "input-token-code": 1, "output-token-code": 5 }, { "pos": 2, "input-token": "TYPE", "output-token": "SUFTYP", "input-token-code": 2, "output-token-code": 6 }, { "pos": 3, "input-token": "UNITT", "output-token": "UNITT", "input-token-code": 17, "output-token-code": 17 }, { "pos": 4, "input-token": "NUMBER", "output-token": "UNITT", "input-token-code": 0, "output-token-code": 17 } ] }, { "no": 1, "score": 0.81125, "rule_string": "0 1 2 17 0 -1 1 5 6 17 17", "rule_tokens": [ { "pos": 0, "input-token": "NUMBER", "output-token": "HOUSE", "input-token-code": 0, "output-token-code": 1 }, { "pos": 1, "input-token": "WORD", "output-token": "STREET", "input-token-code": 1, "output-token-code": 5 }, { "pos": 2, "input-token": "TYPE", "output-token": "SUFTYP", "input-token-code": 2, "output-token-code": 6 }, { "pos": 3, "input-token": "UNITT", "output-token": "UNITT", "input-token-code": 17, "output-token-code": 17 }, { "pos": 4, "input-token": "NUMBER", "output-token": "UNITT", "input-token-code": 0, "output-token-code": 17 } ] }, { "no": 2, "score": 0.754167, "rule_string": "0 1 2 17 0 -1 1 5 6 17 17", "rule_tokens": [ { "pos": 0, "input-token": "NUMBER", "output-token": "HOUSE", "input-token-code": 0, "output-token-code": 1 }, { "pos": 1, "input-token": "WORD", "output-token": "STREET", "input-token-code": 1, "output-token-code": 5 }, { "pos": 2, "input-token": "TYPE", "output-token": "SUFTYP", "input-token-code": 2, "output-token-code": 6 }, { "pos": 3, "input-token": "UNITT", "output-token": "UNITT", "input-token-code": 17, "output-token-code": 17 }, { "pos": 4, "input-token": "NUMBER", "output-token": "UNITT", "input-token-code": 0, "output-token-code": 17 } ] }, { "no": 3, "score": 0.735937, "rule_string": "0 1 2 17 0 -1 1 5 6 17 17", "rule_tokens": [ { "pos": 0, "input-token": "NUMBER", "output-token": "HOUSE", "input-token-code": 0, "output-token-code": 1 }, { "pos": 1, "input-token": "WORD", "output-token": "STREET", "input-token-code": 1, "output-token-code": 5 }, { "pos": 2, "input-token": "TYPE", "output-token": "SUFTYP", "input-token-code": 2, "output-token-code": 6 }, { "pos": 3, "input-token": "UNITT", "output-token": "UNITT", "input-token-code": 17, "output-token-code": 17 }, { "pos": 4, "input-token": "NUMBER", "output-token": "UNITT", "input-token-code": 0, "output-token-code": 17 } ] }, { "no": 4, "score": 0.724167, "rule_string": "0 1 2 17 0 -1 1 5 5 17 17", "rule_tokens": [ { "pos": 0, "input-token": "NUMBER", "output-token": "HOUSE", "input-token-code": 0, "output-token-code": 1 }, { "pos": 1, "input-token": "WORD", "output-token": "STREET", "input-token-code": 1, "output-token-code": 5 }, { "pos": 2, "input-token": "TYPE", "output-token": "STREET", "input-token-code": 2, "output-token-code": 5 }, { "pos": 3, "input-token": "UNITT", "output-token": "UNITT", "input-token-code": 17, "output-token-code": 17 }, { "pos": 4, "input-token": "NUMBER", "output-token": "UNITT", "input-token-code": 0, "output-token-code": 17 } ] }, { "no": 5, "score": 0.70625, "rule_string": "0 1 2 2 0 -1 1 5 5 6 17", "rule_tokens": [ { "pos": 0, "input-token": "NUMBER", "output-token": "HOUSE", "input-token-code": 0, "output-token-code": 1 }, { "pos": 1, "input-token": "WORD", "output-token": "STREET", "input-token-code": 1, "output-token-code": 5 }, { "pos": 2, "input-token": "TYPE", "output-token": "STREET", "input-token-code": 2, "output-token-code": 5 }, { "pos": 3, "input-token": "TYPE", "output-token": "SUFTYP", "input-token-code": 2, "output-token-code": 6 }, { "pos": 4, "input-token": "NUMBER", "output-token": "UNITT", "input-token-code": 0, "output-token-code": 17 } ] } ], "input_tokens": [ { "pos": 0, "token": "NUMBER", "stdword": "1", "token-code": 0 }, { "pos": 0, "token": "WORD", "stdword": "1", "token-code": 1 }, { "pos": 1, "token": "WORD", "stdword": "DEVONSHIRE", "token-code": 1 }, { "pos": 2, "token": "TYPE", "stdword": "PLACE", "token-code": 2 }, { "pos": 3, "token": "TYPE", "stdword": "PATH", "token-code": 2 }, { "pos": 3, "token": "UNITT", "stdword": "PENTHOUSE", "token-code": 17 }, { "pos": 4, "token": "NUMBER", "stdword": "301", "token-code": 0 } ] }
Changes I need to make before I commit my work
1) Add an option for the single address version (this will be the same function, but the last will be a default arg, which when left blank will call parse address first (much like current one does when only one arg).
SELECT s::jsonb FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301, Boston, MA 02109') AS s;
2) Output the macro rules (Not sure why it isn't at moment. This is the one that parses Boston, MA 02109 part 3) Define for each rule what kind of rule it is 4) Finish off the rulestring. Right now the rule_string is missing the type of rule and the weight of the rule to allow it to be easily matched up in the rule table. I also think there might be something wrong with my looping thru the rules since some rules appear to be repeated.
5) Output the results much like what:
SELECT s.* FROM standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301', 'Boston, MA 02109') AS s;
This will be another array node in the document: where each element with be a jsonified stdaddress
e.g.
"standardized_addresses": [ {"building":null,"house_num":"1","predir":null,"qual":null,"pretype":null,"name":"DEVONSHIRE","suftype":"PLACE","sufdir":null,"ruralroute":null,"extra":null,"city":"BOSTON","state":"MASSACHUSETTS","country":null,"postcode":"02109","box":null,"unit":"# PENTHOUSE 301"} ]
I was thinking of skipping the nulls and eventually adding the lower ranking addresses after.
Change History (9)
comment:1 by , 17 months ago
comment:6 by , 17 months ago
Type: | defect → enhancement |
---|
comment:8 by , 17 months ago
I still need to add in the rule id and full rule string and then also hoping to figure out adding the rules for macro (not sure why they are not coming thru)
In 06bed87/git: