Hack 45. Extract a Spatial Model from Wikipedia

The Wiki encyclopedia covers the world; mine its data structure for use in mapping applications.

Wikipedia is the fantastic publicly editable online world encyclopedia. Maintained by casual online volunteers, many who are experts in their fields, it aims for a "neutral point of view" in recounting world knowledge, history, places, and political systems. This hack discusses the English Wikipedia (http://en.wikipedia.org/), but the principle will work with any of their many language editions, featuring mostly original articles.

"Wiki nature" allows anyone on the Web to edit each page; it also allows anyone to create links between pages using a simple markup, which encloses the page name in square brackets [[Like This]]. In Wikipedia, there's a special URL syntax to get a list of which Wikipedia sites link to each Wikipedia page. For example, http://en.wikipedia.org/wiki/Argentina is the country page for Argentina, and http://en.wikipedia.org/w/wiki.phtml?title=Special:Whatlinkshere&target=Argentina is the page showing all the backlinks to every page that refers to Argentina, shown in Figure 4-17.

Figure 4-17. Pages linking to a country in Wikipedia

 

4.12.1. Modeling Wikipedia

Implicit in the structure of Wikipedia is kind of spatial index to events and people and ideas! Some of these links will be to places we can identify and geocode in their own right: country, cities, towns, regions.

For each country page in Wikipedia, we can build a set of related pages through backlinks. Some of them are lists in which every country appears; others are "History of..." and "Politics Of..." pages, sites about towns and cities, and pages about important dates and people. Each of these pages links to one or many countries and cities in turn.

Wikipedia has a lot of spatial data in it, including the available reference data from the CIA World Factbook for each country. Many country pages have beautifully drawn flat maps. Wikipedia is rich with information about the government and administration structures of many countries, but it doesn't have structured metadata that is machine-intelligible. We'll have to do good guesswork to geolocate pages.

4.12.1.1 Countries

First we need a list of things that Wikipedia identifies as countries. There is a Wikipedia page on countries with ISO codes, which is a great place to start. With a quick regular expression, we can extract the list:

#!/usr/bin/perl use strict; use LWP::UserAgent; # download the page from wikipedia my $ua = new LWP::UserAgent; $ua->agent("WikiMap/1.0 " . $ua->agent); my $page = $ua->get('http://en.wikipedia.org/wiki/ISO_3166-1_alpha-2')->content; my @lines = split(" ",$page); my %countries; foreach (@lines) { # match a 2-uppercase-letter code, and a title my ($code,$name) = $_ =~ /([A-Z]{2}).+title="([^"]+)/; $countries{$code} = $name if $name; }

This gives us a list of everything Wikipedia thinks is a country, with its two-letter ISO code. We can use this list to collect the country page backlinks from Wikipedia. To be reasonably polite to the site, we sleep for a few seconds between each page request.

This script uses a simple Perl RDF store, Class::RDF, to store the model of Wikipedia. The model is a graph, each node is a page, and there are arcs of connections between them. Web links don't differentiate between different kinds of connections, so we'll just define one relationship, "connects," that describes how one page links to another:

foreach my $c (keys %countries) { my $code = $iso_base.uc($c); my $url = $wiki_base .'wiki/'. $countries{$c}; my $country = Class::RDF::Object->new( rdf->type => wm->Country, iso->code => $code, iso->name => $countries{$c}, wm->wiki_page => $url); my $links = $wiki_base . 'w/wiki.phtml?title=Special: Whatlinkshere&target='.$countries{$c}; my $doc = $ua->get($links)->content; my @lines = split(" ",$doc); foreach (@lines) { my ($link,$name) = $doc =~ /\<a href=""> ([w| ])</a/; if ($link and $name) { $link = $wiki_base.$link; my ($object) = Class::RDF->search(wm->wiki_page => $link); $object = Class::RDF::Object->create(wm->wiki_page => $link, wm->name => $name) if not $object; $object->wm::connects($country); } } }</a>

<a href="">Watching this script run is like watching a potted conceptual history of the world. Now, we've built a graph of all the countries in the world, according to Wikipedia, with links to famous people, places, and events. But we don't know which are which, nor can we distinguish casual from important mentions.</a>

<a href="">4.12.1.2 Cities and other spatial things</a>

<a href="">We can deepen this spatial index by using a gazetteer service. [Hack #84] was written with this purpose in mind. We go through the list of each country's backlinks and, for page names that look likely to be places, try to find them in the gazetteer. We use a simple set of rules of thumb, partially borrowed from Maciej Ceglowski (http://www.idlewords.com), to identify things worth trying to geocode:</a>

<a href="">To request the information about the city from the gazetteer, we issue this GET request (you can try this out in a web browser):</a>

<a href="">http://mappinghacks.com/cgi-bin/gazetteer.cgi?query=Placename&country=ISO& format=rdf</a>

<a href="">We might get multiple results back for the same place name, in which case we'll lazily attach them all to the Wikipedia page. They may be different types of features: an administrative area, a populated place, and also a natural or landmark feature sharing its name.</a>

<a href="">4.12.2. Graphing Wikipedia</a>

<a href="">To graph Wikipedia, we'll draw something that we could call a cartogram. A cartogram represents spatial relations, but without conforming to a particular geometry or projectiona more abstract way of drawing pictures of geospatial data than cartographic maps. As our model of Wikipedia is a graph of connections, it's ideal for GraphViz, the excellent free graph layout program. GraphViz is available at http://www.graphviz.org/, and RPMs are available for Linux distributions (you can get it in Debian by running apt-get install graphviz as root). The GraphViz Perl module allows us to easily get at it programmatically.</a>

<a href="">Figure 4-18 shows the graph of a small section of Wikipedia's spatial relations, centered on the UK. The graph of all spatial interconnections in Wikipedia is too big for GraphViz to easily handle!</a>

<a href="">Figure 4-18. Graphing some of the spatial relations in Wikipedia</a>

<a href=""></a>

 

<a href="">4.12.3. Hacking the Hack</a>

<a href="">You can use these techniques with any corpus of text that you have on a disk somewhere, over which it might be useful to have a spatial index. A collection of news articles or RSS feeds that you want to display on a map, or old email archives to map project activity.</a>

<a href="">We chose to ignore pages whose titles were numbers. Collecting all pages whose titles are four digits and/or that match the expression /As of ^d{4}$/, you could build a pretty nice temporal index of Wikipedia. Imagine a map where you adjust a knob and see local content and political boundaries change through time. "As of" is a recent naming convention for dating pages, sometimes used in different ways, and the temporal index probably won't reveal much depth yet, but Wikipedia is always growing and refining itself.</a>

<a href=""> </a>

<a href="">Hack 46 Map Global Weather Conditions</a>

Категории