Disclaimer: I’m posting this mainly as a means of remembering the described process. Scraping government data is supposed to be legal in most cases.
The NYC Data Mine is a government project that exposes popular datasets about many governmental agencies/departments. I’m a huge supporter of the transparency of governmental agencies in this regard — especially in allowing developers to create some cool and useful products with their datasets. The datasets come in two flavors: RAW (txt, csv, etc) and Geographical (SHP, and others).
Scraping to SHP
Unfortunately, the site doesn’t happen to have the dataset that I need for a new, top-secret indie project. My first attempt at grabbing the data involved using html scraping techniques to parse the html pages from their searchable site. That turned out to be a headache and pretty crude, so I was back to the drawing board. The thing is, the datamine actually did have the data I needed, but it’s in a geographical format known as SHP (shapefile). Shapefiles are graphically render-able files that are used for more geographical information system (GIS) tasks. Since the RAW data wasn’t available, and I won’t spend the 10 months waiting for the agency to respond to my data request, I had to transform the graphical format into a usable format.
SHP to KLM
I couldn’t find any readily-available SHP to CSV (one of the more popular raw data file formats) converters, but I did find two cool programs called Shape2Earth and MapWindow. Basically, Shape2Earth is a plugin for the MapWindows application.
MapWindow reads SHP files and (using the Shape2Earth plugin) converts the SHP data into a KLM (Keyhole Markup Language) format (mainly used for visualizations in Google Earth). Try opening the resulting KLM file into Google Earth for some an awesome visualization! Of course, KLM isn’t the ideal format for any RAW data manipulation.
KLM to CSV
I found another useful program called KMLCSV Converter freely available. This program does as the name indicates: converts KLM files to CSV files. The conversion was smooth and painless. The resulting CSV file consisted of two columns containing geolocation data — namely longitude and latitude coordinates for each top-secret item.
CSV to TXT
The next task was to get this geolocation into a MySQL database for use in a web or mobile app. I decided to convert the CSV to TXT to use PHP to easily parse the textfile of geolocation data. Using a little PHP magic, I now have a database full of organized geolocation data for the top-secret items.
The Lesson Learned
I’ll be pissed if the government agency emails me tomorrow with the dataset! In any case, scraping is really tedious and error-prone. However, without finding the SHP files, I probably would have to scrape about 400 individual webpages for the necessary information.
Thanks to some free software, the conversion process wasn’t bad. Of course, a more programmatic way of handling the conversions would automatize the whole system so that no manual intervention would be necessary. Further work will need to be done to speed up the process; thank goodness the government agency only updates the data every quarter!