#2193 closed enhancement (fixed)
Utilize PAGC parser as drop in replacement for tiger normalizer
Reported by: | robe | Owned by: | robe |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.1.0 |
Component: | pagc_address_parser | Version: | master |
Keywords: | history | Cc: | woodbri |
Description
Now that we have the PAGC parser working with PostgreSQL 9.2 (have to test 9.1) and also working under my development environment - can now explore utilizing PAGC address normalizer as a drop in replacement for in-built tiger normalizer.
The plan is this:
Check if address_standardizer extension is installed, which is installable with command:
CREATE EXTENSION address_standardizer;
If it is installed, use it instead of packaged tiger geocoder normalize_address function to normalize the addresses.
I'm planning to have this packaged in winnie interim windows builds.
Some concerns/suggestions
1) it seems I have to package along pcre*.dlls would be nice if I could just statically compile this into the address_standardizer.dll. I hate having all these extra children tagging along.
2) I'm tempted to say we should just include the rules.sql, lexicon.sql, gazeteer.sql as part of the extension. Perhaps having another field to denote which records are packaged so that users can customize and ensure their custom records are backed up.
I just think having to install 3 extra files ruins the beauty of extension.
Change History (20)
comment:1 by , 12 years ago
comment:2 by , 12 years ago
Yah I was thinking that too after I made that comment that those may be different depending on locality and might be better packaged as separate extension or configuration module you load with a function in address_standardizer extension similar to the Full Text dictionaries to denote which to use.
Then each can be packaged as an extension as you have it similar to the FText dictionaries here: http://pgxn.org/tag/dictionary/
comment:3 by , 12 years ago
Steve,
As a general note I'm hoping to have this integrated in 2.1 tiger geocoder as an option. I guess we'll need to decide how the code base will work. I'm indifferent. 1) Do we just point people at your repo to download the address_standardizer or do we incorporate that piece in PostGIS as an extra (in extras folder) separate but related to tiger_geocoder.
2) If it's an extra I would document it in the PostGIS docs as a separate extra useable with tiger geocoder with its own dedicated section in the docs
If separate -- I'd have it as a subsection in tiger geocoder docs -- where to download install and basic instructions for using with tiger geocoder and link to get more info.
Similar to how we do libxml, json-c
thoughts?
comment:4 by , 12 years ago
Regina,
I think initially we should just create a directory in your source tree and put the files there. While I have created a branch in PAGC where I developed this code, I have not had time to review the code with Walter to figure out how it should be packaged or if it will even stay in the current form. I think that since PostGIS is the only user of this at the moment and it needs to be in a form suitable for including in PostGIS this makes the most sense for now. I the future as things change we can make some adjustments. In addition, to this Walter and I want to look at refactoring all of PAGC because their may be other useful bits that we can extract in a similar way. For now you need a stable fork just keep it in its own directory and we should be able to pull changes out of a similar directory that I will maintain in a PAGC branch, until we get our end sorted.
Let me know if this makes sense for you.
comment:5 by , 12 years ago
Yes makes sense. I think also as far as packaging the tables, I'll go with as a separate extension but perhaps call it us_support_tables since I think your current is optimized for that.
BTW I just compiled under ming64 but installing the extension on my pg92 edb build. It doesn't seem to be crashing. I haven't tried under mingw compiled pg though so might be a different story. I'll let you know. Were you testing against your mingw compiled postgresql or the EDB VC++ version. The mingw I think my instructions are to compile with debug flags and so forth so may be catching things the EDB installed one isn't.
comment:6 by , 12 years ago
That name is fine. I imagine we will be building more specific tables as needed and I structure my Tiger data a lot differently than I think you do. I have C code that denormalizes the Tiger data into single side records so a single TLID might generate 1+ records if it has multiple sides, address ranges and city names associated to it.
I built and installed pg 9.2 using mingw64 but didn't use the debug flags. Typically stuff that works with debug and not optimized indicates one of two things:
- the optimizer is broken in the compiler
- the code being compiled is broken and overwriting something
I could be wrong, but I thought the later is fairly common with 64 bit stuff because things (structures) need to be aligned on 8 byte boundaries and padded where are most 32 bit compilers don't have that requirement.
comment:7 by , 12 years ago
Component: | tiger geocoder → pagc_address_parser |
---|
comment:8 by , 12 years ago
First step committed at r11241 .
I still need to run thru my normalize regress and integrate in extension as well as regular install script, and enable the geocode_setting option in standard normalizer so it flips to the pagc one if that is choosen.
I'm still deciding about the actual pagc code -- whether it is better to just ahve a download link or package and integrate the compilation (which will be more work). Might take baby steps and just distribute windows binaries and download for other distros to choose to compile separate.
comment:9 by , 12 years ago
Regina, I think that it will be much easier for the rest of the world if we fork a stable set of code for postgis. We don't need to actual move the code into the postgis tree, but I'm think that we should at a minimum have a stable src tree that can easily be fetched and built. Pull the code from 3 separate directories in my branch is not intuitively obvious and I think most distributions will not bother with it.
comment:10 by , 12 years ago
Well in that case we might as well have it in postgis tree. I'll bring up to discussion on postgis-devel next.
comment:11 by , 12 years ago
For the lex, rules, gaz tables, I decided to just include those as part of the tiger install so that all extra that is needed is to install the address parser extension.
committed at r11244
I also added is_custom fields to the lex and gaz tables so people can add additional entries and on upgrade only flushes the out the ones that aren't custom.
comment:12 by , 12 years ago
I made some more minor tweaks only thing left to do is bring the pagc code into code base:
Regarding what gets included in PostGIS tree, would it just be this?
http://pagc.svn.sourceforge.net/viewvc/pagc/branches/sew-refactor/pagclib/api/pgsql
Or we need the whole api. I'm guessing we need the whole api and pgsql since the pgsql seems to reference headers from the api.
Anyrate won't do this until you have that last issue worked out with reusing the built standardizer for subsequent query calls.
comment:13 by , 12 years ago
comment:14 by , 12 years ago
No we also need branches/sew-refactor/parseaddress/. The current process to install this is too complicated and looks like:
- build and install library in parseaddress
- build and install library in pagclib/api
- build and install plpgsql wrapper code in pagclib/api/psql
There is no reason that all the required pieces can not be pulled into a single directory and all be linked into a single postgresql library. This would make the maintenance of it much cleaner from the postgis point of view. I will look into doing this shortly.
comment:15 by , 12 years ago
I also think it makes sense to pack the standardizer into its own extension, separated from postgis.
comment:16 by , 12 years ago
I'm working on making merging the three directories and builds into a single PGXS package. My plan is to do this in two steps:
- repackage the existing files into a PGXS package
- deal with the performance and memory issue we have discussed in email
comment:17 by , 12 years ago
Regina,
I have created a first pass of moving all the address_standardizer code into a single directory using PGXS to build it.
http://pagc.svn.sourceforge.net/viewvc/pagc/branches/sew-refactor/postgresql/
I have not installed or debugged it yet, but I wanted to get your feed back on it. Things that still need to be done are updating the README.address_standardizer, you made some changes to the lex, gaz, and rules so that users could locally modify them and so they would load as part of the extension.
I'm thinking we should include prebuilt headers to avoid the perl requirements unless you want to make changes to the package proper.
comment:18 by , 12 years ago
I liek the idea of pre-built headers. I'm not sure if I wa able to build those since I think you gave them to me.
Actually the lex, gaz, rules I included as part of the tiger geocoder extension (and called pagc_lex, pagc_gaz, pagc_rules. I think these changed ones would be best not to package with the address_standardarizer extension (and we can stick with what you have but we can add the custom field as I have in mine so people can modify for their needs). For tiger geocoder, I made changes specific to tiger geocoder such as abbreviations for street and state instead of full names so it conforms more to the norm_addy type the tiger geocoder uses.
I'll take a look at what you have in the next couple of days.
comment:19 by , 12 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
needs some fine tuning but essentially done
comment:20 by , 12 years ago
Keywords: | history added |
---|
pcre.dll is only used in address_parser which takes an address in a single string and tries to break it into components. This is pretty handy and I think most people will want to use it.
I don't have a strong feeling about static linking or not on Windows.
For the rules.sql, lexicon.sql, gazeteer.sql as part of the extension, I think I would make these a separate extension if possible, may be like:
While the current geocoder is focused on Tiger, using this extension makes it trivial to support other data like Canadian Census data, or commercial data like Navteq or TeleAtlas, just be loading a different set of lexicon, gazeteer and rules. I have already done this by by using splitting the the vendor data into one table and then standardizing into a second table the is linked by the vendor data table gid. This essentially allow you to decouple the vendor data for the standardizer and geocoder search and result scoring. My code is not totally there yet but it is moving in that direction.
I can see people wanting to add their own changes to these tables, but I can also see the need to wholesale replace them for different data as I'm doing. That said, I can work around whatever you think is best for the tiger geocoder.