May 3, 2020

Marathon mapping with D3, GeoPy, and a little Node.js

This post outlines the steps I took to make a set of visualizations of US marathons by month of race, date of inception, and temperature. I was motivated to make this visualization based on both a personal curiosity about marathon data and a desire to explore some new tools — particularly map projections in D3.js. This post goes through how I gathered and merged the data from its several sources and then made the visualization, touching on the following topics:

Many of these procedures were new to me. As such, this post is presented in the spirit of learning and sharing progress.

I got my initial list of marathons from Wikipedia's list of North American marathons using Python. Helpfully, the pandas library can create DataFrames from html tables on a webpage given its url (you'll of course want to be courteous in using this capability). Here, I get the data data into a DataFrame and perform a few initial manipulations:

import pandas as pd 

# Get the first of list of tables at the given URL containing "Boston Marathon", as a DataFrame 
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_marathon_races_in_North_America", match="Boston Marathon")[0]

# Shorten the name of the month column 
df.rename(columns={'Month**Last race was held': 'Month'}, inplace=True)

# Drop Ref and Link columns because I won't need them
df.drop(['Ref', 'Link'], axis=1, inplace=True)

# Filter to US marathons
df = df[df['Country'].str.contains('United States')]

# Create latitude and longitude columns for later use
df['latitude'] = pd.Series() 
df['longitude'] = pd.Series()

            

This list is not a complete catalog of US marathons, but it will be good enough for my purposes with this project.

The DataFrame of marathons that we now have has a location for each race in the form of the text in the City column. That data is helpful to have, but in order to plot each marathon on a map, we need the coordinates for in the form of latitude and longitude for each race. The Python package GeoPy provides tools to easily convert from city name (or other specifications of location, such as an address) to latitude and longitude, using your choice of several geocoding web services. I chose to use GeoNames geocoder, and made a login for this service.

Geopy is installable via pip. Here, I use GeoPy to coordinate with GeoNames, which does the heavy lifting in finding a latitude and longitude for each marathon city, and save the output as a csv.

from geopy import geocoders

gn = geocoders.GeoNames(username=USERNAME_GOES_HERE) 
for idx, city in df['City'].iteritems():
    if gn.geocode(city) is not None:
        lat_lon = gn.geocode(city)[1]
        df.loc[idx, ['latitude', 'longitude']] = lat_lon
pd.to_csv('us_marathons.csv')

A few marathon city entries were not successfully geocoded because of their format — for example, their city name was blank, or was something like "Hopkinton to Boston, Massachusetts" — so I looked up a few latitudes and longitudes manually.

Next, I obtained historical average daily high and low temperatures for the location of each marathon during the month in which the marathon is held.

I wanted to use this project to work more with Node.js, so I wrote JavaScript code to obtain the temperature data from World Weather Online's API. An alternative option, closer to my coding comfort zone, would have been to use Python's requests module to perform the HTTP requests for temperature instead.

I used the JavaScript request library to make my requests to World Weather Online's API and the fast-xml-parser library to parse the responses. Because I had saved my marathon data as a csv, I also used the libraries fs, csv-parse, and csv-writer to read in the csv and then write an updated csv.

I had to play around a little with promises and asynchronous programming here, and I know a pro could do this task more elegantly. The code shows one way to make sure we will first load the data from the csv, then make a request for temperature data for a location then when the result is returned, include that result in the data, then when the temperatures for each location have been obtained, save the result as a csv.


// Import libraries
const fs = require('fs');
const request = require('request');
const xmlParser = require('fast-xml-parser');
const csvParser = require('csv-parse');
const createCsvWriter = require('csv-writer').createObjectCsvWriter;

// Initialize csvWriter
const writeFileName = "us_marathons_with_temp.csv"
const csvWriter = createCsvWriter({
    path: writeFileName,
    header: [
        {id: 'Name', title: 'Name'},
        {id: 'City', title: 'City'},
        {id: 'Country', title: 'Country'},
        {id: 'Month', title: 'Month'},
        {id: 'Inception', title: 'Inception'},
        {id: 'latitude', title: 'latitude'},
        {id: 'longitude', title: 'longitude'},
        {id: 'avgMinTemp_F', title: 'avgMinTemp_F'},
        {id: 'avgMaxTemp_F', title: 'avgMaxTemp_F'},
    ]
});

// List of month names
const monthList=['January','February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'];

// Parser will parse the csv, iterate through each line, and request the monthly average for each location
var parser = csvParser({delimiter: ',',columns: true}, function(err, data){
    const forLoop = async _ => {
        for(var i=0; i < data.length; i++){
            // Extract month index (Jan=1,Feb=2,etc.) for this row's race
            var raceMonth = data[i].Month;
            var raceMonthNum = monthList.indexOf(raceMonth);

            // Extract latitude and longitude and use it to create query 
            lat = data[i].latitude;
            lon = data[i].longitude;
            input.query = String(lat) + ',' + String(lon);

            var weatherUrl = _ApiBaseURL + 'weather.ashx?q=' + input.query + '&format=' + input.format + '&num_of_days=' + input.num_of_days + '&date=' + input.date + '&key=' + input.key;
            await getData(weatherUrl,raceMonthNum).then(async function(monthData){
                data[i].avgMinTemp_F = monthData.avgMinTemp_F;
                data[i].avgMaxTemp_F = monthData.absMaxTemp_F;
            });
        }
    }
    forLoop().then(function(){
        csvWriter.writeRecords(data);
    });
    
    });

// Request temperature data and return data for race month as a Promise
function getData(url, raceMonthNum){
    return new Promise(function(resolve, reject){
        request(url, {method:'GET'}, function(err, res, body){
            if(err){
                reject(err);
            } else {
                var jsonObj = xmlParser.parse(body);
                var monthData = jsonObj.data.ClimateAverages.month[raceMonthNum];
                resolve(monthData);
            }
        });
    });
}

// Object for constructing API query
var input = {
    query: '42.3495,-71.0744',
    format: 'xml',
    num_of_days: '1',
    date: '',
    key: YOUR_API_KEY_HERE
}

// Read the csv and pipe it to the parser
fs.createReadStream('us_marathons.csv').pipe(parser);

            

To create the map, I added a div with id us-map to my html file to serve as a container for the map. I also needed to include D3 (v5) and the topojson file that I wanted, so I included the following in my html:

<script src="https://d3js.org/d3.v5.min.js"></script>
<script src="https://d3js.org/topojson.v2.min.js"></script>

In my JavaScript source code, I then created the map and the markers with the following steps.

First, I used D3 to create an svg element within the us-map div, with height and width sent based on the width of the body element.

Next, I created a US map projection with d3.geoAlbersUSA() and passed that projection to d3.geoPath() to create a path for the drawing of the map. I then loaded the appropriate topojson US map (I wanted one that showed state boundaries), and used the geoPath together with the data from the topojson file to draw the map. Adding the the land and boundary classes allowed for css styling of these paths.

Once the map was loaded, I used d3.csv() to load the marathon data, which I joined to circle elements. I set the position of those circles using their associated latitude and longitude. Here the projection does the work of mapping from latitude and longitude to the position within the svg to set the circles' cx and cy properties. The fill of the circles is colored according to the corresponding temperature data, with the help of a colorscale created with d3.scaleSequential and d3.interpolateRdYlBu.


var width = d3.min([parseInt(d3.select('body').style("width"))-50,1000])
var height = width * 5.5 /10;

// Create an svg to hold the map
var svg = d3.select("#us-map")
        .append("svg")
        .attr("width", width)
        .attr("height", height);

// Create the map projection and a path for it
var projection = d3.geoAlbersUsa()
    .translate([width/2, height/2])    
    .scale([width*1.2]);

var path = d3.geoPath()
    .projection(projection);

// Create a colorscale for the circles fill color (using a domain of 45-85 deg F)
var colorscale = d3.scaleSequential(d3.interpolateRdYlBu).domain([85,45]);


// Load the map json and then the data csv, and add circles for each data point
d3.json('states-10m.json').then(us => {
    svg.append("path")
        .datum(topojson.feature(us, us.objects.states))
        .attr("class", "land")
        .attr("d", path);
    svg.append("path")
        .datum(topojson.mesh(us, us.objects.states, (a, b) => a !== b))
        .attr("class", "boundary")
        .attr("d", path);
    
    // Once the map is loaded, load the csv and draw a circle for each data point at the correct latitude and longitude
    d3.csv('data/us_marathons_coded.csv').then(data =>{
        svg.selectAll("circle")
            .data(data)
            .enter()
            .append("circle")
            .attr("cx", function(d){return projection([Number(d.longitude), Number(d.latitude)])[0]; })
            .attr("cy", function(d){return projection([Number(d.longitude), Number(d.latitude)])[1]; })
            .attr("r", 3)
            // Style the circles to be colored according to the average daily max temperature
            .style("fill",function(d){
                return colorscale(d.avgMaxTemp_F);
             })
            .style("opacity","1.0")
            // Show tooltips on hover and click (showTempToolTip() is defined in the next section)
            .on("mouseover", function(d) { 
                if(this.style.opacity > 0.2){
                    showTempToolTip(d)
                } 
            })
            .on("click", function(d) { 
                if(this.style.opacity > 0.2){
                    showTempToolTip(d)
                }  
            }) 
    });
});

The tooltip was created with just one div element appended to the body, with its own class of tooltip added for easy selection and styling. When the user hover over or click on a circle, we want to move the div containing the tooltip to the correct (absolute) location so that it is right next to the circle, set the text based on the data corresponding to that circle, and set the tooltip's opacity high enough for it to be visible.

Here is how I add the div for the tooltip and give it a class using D3:
// Append div for tooltip to SVG
var toolTipDiv = d3.select("body")
                .append("div")   
                .attr("class", "tooltip")               
                .style("opacity", 0);
                

The css styling that I add to my external stylesheet to style the tooltip is:

.tooltip {   
    position: absolute;           
    text-align: center;           
    width: 80px;                  
    height: auto;                 
    padding: 2px;             
    font: 12px sans-serif;        
    background: white;   
    border: 0px;      
    border-radius: 8px;           
    pointer-events: none;        
}

In the previous section, when I constructed the map, my code would run the showTempToolTip() function on hover or click for each circle. Now I need to write that function to position the tooltip, create text, and set the opacity (with a smooth transition for the opacity).


function showTempTooltip(d){
    tempToolTipDiv.transition()        
        .duration(200)      
        .style("opacity", .9); 
    tempToolTipDiv.html('')
    tempToolTipDiv.append("p").text(d.Name);
    tempToolTipDiv.append("p").text("avg. high: " + d.avgMaxTemp_F)
    tempToolTipDiv.append("p").text("avg. low: " + d.avgMinTemp_F)
    tempToolTipDiv
        .style("left", (d3.event.pageX) + "px")     
        .style("top", (d3.event.pageY - 28) + "px"); 
}

Next I wanted to visualize marathons by month, so that the user could manipulate a slider to select different months of the year, and see marathons appear on the map for the corresponding month.

Creating a slider from scratch in D3 seems like a cool challenge to tackle, but for this step I was happy to use a nice options that was already are out there. I used John Walley's d3 simple slider, adapting his example posted here.

I added divs with the ids slider-months and slider-value to contain the slider and its associated text to my html and also include the following script:

<script src="https://unpkg.com/d3-simple-slider"> </script>

Then, we can add the slider itself:


const dataTime=[1,2,3,4,5,6,7,8,9,10,11,12];
const monthList=['January','February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'];
const monthListShort=['Jan','Feb','Mar','Apr', 'May', 'Jun', 'Jul', 'Aug','Sep','Oct','Nov','Dec'];

// Months slider
var sliderMonths = d3
    .sliderBottom()
    .min(1)
    .max(12)
    .step(1)
    .width(0.9*width)
    .tickFormat(function(d,i){return monthListShort[i]})
    .tickValues(dataTime)
    .default(4)
    .handle(
        d3.symbol()
            .type(d3.symbolCircle)
            .size(200)
        )
    .on('onchange', val => {
        d3.select('#value-months').text("US marathons in " + monthList[val-1]);
        svg.selectAll("circle")
            .style("opacity", function(d){ 
                if (d.Month == monthList[val-1]){return 1.0; }
                else {return 0;}
            })
        d3.select('#slider-months g.parameter-value text').text(monthListShort[val-1]);

    })
    .on("end", val=> {
        d3.select('#slider-months g.parameter-value text').text(monthListShort[val-1]);
    });

var gTimeMonths = d3
    .select('div#slider-months')
    .append('svg')
    .attr('width', width)
    .attr('height', 100)
    .append('g')
    .attr('transform', 'translate(15,15)');

gTimeMonths.call(sliderMonths);

Now we make the map react to the slider. One way to show only the marathons in a given month is to set the opacity of the circles corresponding to marathons to 1 if they are in the selected month, and 0 otherwise.

svg.selectAll("circle")
    .style("opacity", function(d){ 
        if (d.Month == monthList[sliderMonths.value()-1]){return 1.0; }
        else {return 0;}
    });

The above steps gave me the main features that I wanted my visualization to have. From there, there were a few more steps I took to get the final look and functionality that I wanted. For concision, I will only summarize these steps here. You can see their full implementation in the source code. These steps were:

  • Create a colorbar legend for the temperature
  • Break the visualization into three (by month, inception data, and temperature), with a selection tab for each. I added classes to a number of elements (such as sliders), so that I could easily select and set elements that were not corresponding to the currently selected tab to have a display attribute of "none" or a visibility attribute of "hidden"
  • Add a slider for inception year
  • Add a tooltip that would show only the race name (instead of name and temperature) when the temperature tab was not selected
  • Use the inception year to calculate the total number of races created by each year
  • Add a small line graph to the inception year tab to show total marathons over time
  • Find race day temperatures for a few races of interest (e.g. Boston)
  • Create a small line graph in the temperature tab to show race day highs and lows over time for a few races of interest

That's all - thanks for reading!