Hack 81. Clean Up U.S. Addresses
As all election officials know, identifying the same address when it is formatted differently can be a tricky problem.
Many of the hacks in this book are meant to help us work with large volumes of geographical information. Most of the data sets described in this book are systematically gathered and organized by mapping, surveying, and databasing professionals. This results in well-defined and well-formatted data, ideal for programmatic processing (and hacking). However, other interesting data sets are populated by humans running around in the world at large who love to make typos, misspell words, enter data in the wrong fields, and mangle information in every (un)imaginable way.
[Hack #16] was powered by just such a messy database that is distributed, conveniently, by the United States Federal Election Commission. The FEC provides software for campaigns and political committees to file their contribution records electronically, but this software is fed by people contributing over the Web or by staffers entering data from contributions collected at fundraising events or via paper mail. These filings are archived and posted by the FEC on the Web at http://herndon2.sdrdc.com/dcdev/. They contain the amount and date of each contribution, as well as the name, street address, city, state, ZIP Code, occupation, and employer of the contributor.
The data was not entirely useless in its original format. U.S. ZIP Codes, for example, are relatively clean because they are simply five numeric digits that are hard to screw up when typed into a computer and easy to validate as input. They allowed for cross-referencing with U.S. census data and for making national maps aggregated by county, ZIP Code, and state. The resulting maps were interesting, but hardly the kind of hack that was going to attract millions of visitors to the site.
Only after the contribution records were geocoded, allowing for street-by-street "money maps" and geocentric "neighbor search" did Fundrace shine in its full glory. However, some fraction of the addresses in the database did not pass muster. There was little to be done about individual or unique mistakes such as blatant misspellings of actual street names, but there should have been a method implemented to clean up more predictable mistakes.
Generally speaking, a good geohack combines a number of different types of information, all with some kind of geospatial significance, in a new, fun, interesting way. But what to do when one of those data sources is too much of a mess to be cross-referenced against the others with an acceptable success rate, as was the case with the FEC contribution records? Go nuts with regular expressions of course! They are ideal for such clean-up operations as normalizing street types/prefixes/suffixes, dropping extraneous intra-building indicators, and excising the general detritus of human input.
7.5.1. The Code
The following Perl code was used by Fundrace to clean up each street address so that it could be successfully geocoded [Hack #79] . This allowed us to use the spatial data in thousands and thousands of additional records to make political money maps. Mostly composed of a series of regular expressions that expand odd abbreviations, it was written with New York City street-naming idiosyncrasies in mind, but a selection from it should work anywhere in the U.S:
#!/usr/bin/perl use strict; my $addr = shift; print cleanAddr($addr); #for turning numerical spellings into digits our $spelled_nums = { first => '1st', second => '2nd', third => '3rd', fourth => '4th', fifth => '5th', sixth => '6th', seventh => '7th', eigth => '8th', nineth => '9th', tenth => '10th', eleventh => '11th', twelfth => '12th', thiteenth => '13th', fourteenth => '14th', fifteenth => '15th', sixteenth => '16th', seventeenth => '17th', eighteenth => '18th', nineteenth => '19th', one => 1, two => 2, three => 3, four => 4, five => 5, six => 6, seven => 7, eight => 8, nine => 9, ten => 10, }; #for adding the correct suffixes to numerically named streets our $num_suffixes = { 1 => 'st', 2 => 'nd', 3 => 'rd', 4 => 'th', 5 => 'th', 6 => 'th', 7 => 'th', 8 => 'th', 9 => 'th', }; sub cleanAddr { my $addr = shift; my $orig = $addr; # dropping off all intra-building identifiers $addr =~ s/#.*$//i; $addr =~ s/,.*$//i; $addr =~ s/Apt.*$//i; $addr =~ s/Apartment.*$//i; $addr =~ s/Loft.*$//i; $addr =~ s/d+[a-z]*s*Fl.*$//i; $addr =~ s/d+[a-z]*s*Floor.*$//i; $addr =~ s/Fl.*$//i; $addr =~ s/Floor.*$//i; $addr =~ s/Rm.*$//i; $addr =~ s/Room.*$//i; $addr =~ s/PMB.*$//i; $addr =~ s/Bsmt.*$//i; $addr =~ s/Basement.*$//i; $addr =~ s/PH.*$//i; $addr =~ s/Penthouse.*$//i; $addr =~ s/Ste.*$//i; $addr =~ s/Suite.*$//i; # Real numeric suffixes ('st', 'nd', 'rd' as in 1st, 2nd, 3rd) # are all more than 1 character. Thus, we assume that a number # followed by only one letter is an apartment indicator, if it has # been preceeded by some text. $addr =~ s/(.*[a-z]+.*)d+[a-z]/$1/i; # In NYC, people sometimes write E12 for E 12th street $addr =~ s/(E|W|East|West)(d)/$1 $2/i; $addr =~ s/East/E/i; $addr =~ s/West/W/i; # There is a "West St" in NYC $addr =~ s/W St/West St/i; # got me?!?!? $addr =~ s/^(d+)[a-z]+(s)/$1$2/i; # Broadway often abbreviated as B'way $addr =~ s/B.way/Broadway/; # Normalize the most common street types, and some common misspellings $addr =~ s/Avenue/Ave/i; $addr =~ s/Avfenue/Ave/i; $addr =~ s/Street/St/i; $addr =~ s/Road/Rd/i; $addr =~ s/Boulevard/Blvd/i; $addr =~ s/Plaza/Plz/i; $addr =~ s/Stret/St/i; $addr =~ s/Treet/St/i; $addr =~ s/Place/Pl/i; # Turn spelled numbers into digits while( my ($k, $v) = each(%$spelled_nums)) { $addr =~ s/()$k()/$1$v$2/is; } # They do weird things with bipartite addresses in Queens, NY $addr =~ s/[^sdw-]/ /ig; # fixing numerical suffixes while(my ($num, $suf) = each(%$num_suffixes)) { $_ = $addr; my ($digits, $currNum, $currSuff) = /[^d](d*)($num)([a-z]+)/is; if(defined($currNum) && defined($currSuff) && lc($currSuff) ne $suf) { my $oldAddr = $addr; # if it's a teen, then suffix is always 'th' if(defined($digits) && length($digits) > 0 && $digits =~ /1$/) { $suf = "th"; } if($currSuff =~ /w+st$/i) { $suf .= " St"; } $addr =~ s/([^d])$digits($num[a-z]+)()/$1$digits$currNum$suf$3/is; last; } } # no extra whitespace $addr =~ s/s+/ /igs; return $addr; }
Michael Frumin