Women Tech Women Yes

WTWY hosts an annual gala at the beginning of the summer. Beforehand, they send out street teams to collect email signups at subway entrances. Ideally, the email signups will convert to gala attendees and contributions. To optimize the effectiveness of these street teams, this noble and fictional nonprofit has asked two of the world's most brilliant data scientists, Sam Funk and myself, for help. As our first project in the Metis Data Science Bootcamp, we combined New York City subway data and census data to recommend where they send their teams.

Accessing the data

From WTWY's goal of maximizing gala attendees and donations we targeted two demographics: early career women interested in STEM and wealthy individuals. Census data contained statistics that served as estimates for each of these demographics. 

Map created by Frank Donnelly

Map created by Frank Donnelly

We learned of the Census data broken down by zip code thanks to Frank Donnelly, a graduate student at Baruch College. Frank has built a wonderful interactive map that helped us explore what was available and what would be significant. However, to actually access the data, we went straight to the source, the American Community Survey 5-Year Estimate. These data were very clean and consistent. Thank you, Uncle Sam.

The MTA publishes turnstile data on a weekly basis. These data were not so clean and consistent; wrangling it was the bulk of this project. More on that below.


    The census data is organized by zip code; the subway data is organized by subway station. To combine these data we needed to make two large assumptions. First, we assumed that every person riding the subway works an office job with regular office hours. Second, we assumed that all riders have simple commutes. In the morning, they board a train near where they live and depart near where they work. In the evening, they do the opposite.

    These are obviously broad assumptions. They are not realistic for any city, let alone the city that never sleeps. But these assumptions served our purposes well. First we are most interested in commuters who work regular office jobs. Second, we are interested in career focussed tech workers, ones who are likely to make more and pay more to minimize their commutes.

    These assumptions allow us to combine the data. The census data is based on where people live. If everyone commuting lives near their station, we can assume that morning rush hour entries and evening rush hour exits correspond to folks who live nearby.

      Census Data

      Census data for each zip code allowed us to pull three statistics that served as estimates of these demographics:

      1. Percent of population that are women aged 20 to 34.
      2. Percent of population with bachelor's degrees that are women with bachelor's degrees in "science, engineering, or a related field."
      3. Median Income

      We combined the first two statistics to estimate the percent of early career STEM women. This metric is obviously imprecise, but that's ok. We're not interested in absolute values for each zip code, but for how these zip codes rank relative to other values. Ranking median income was more straightforward, but it did max out at a reported income of "$250,000 or more." This means our income analysis probably has an artificially low ceiling. I've visited New York City twice, and it seems that anyone would have to be a millionaire in order to live there. But those millionaires probably aren't taking the subway anyway.

      Wrangling the MTA Turnstile Data

      Now for the fun part. The MTA publishes cumulative entry and exit records for every station, as well as the timestamp when the recording was taken. Below is an snippet of entries for the 103rd St. station on the BC line. 

      Two days of records for the 103rd St. station on the BC line.

      Two days of records for the 103rd St. station on the BC line.

      This may seem easy enough. Simply subtract each row from the next to get new entries and exits. However these records are for each turnstile not station. We must first group the data by turnstile.

      The tags C/AUNIT, and SCP are somewhat vaguely defined by the MTA, but it was clear that grouping by some or all of these tags would bin the data by turnstile. Visualizing the records helped discern when the records switched turnstiles.

      Exploring Binning by SCP.png

      Above, we plot, in chronological order, all the records for the 103rd St. station for the first week in our dataset. There are clear breaks that imply different turnstiles. These breaks occur each time the records log a change in the SCP. (Grouping by other tags helps narrow in on a specific station, but SCP appears to be the most granular level.)

      Turnstile Resets

      The turnstiles cannot count up forever. Eventually, they rollover and start counting up from zero again. When this occurs, we calculate an inaccurately high value for new entries or exits. Sometimes this rollover value is obviously unrealistically high - such as 6 million people in a four hour period - but other times it masks itself as plausible data. Luckily, these rollovers are rare. We dispose of any record that claims more than 3600 new commuters, or more than one every four seconds across a four hour period.

      Removing large spikes in the data also helps us focus on only habitual commuters. For example, if Kevin Hart performs at Madison Square Garden we might see 20,000 fans exiting just a few stations. But these fans aren't regular commuters and mapping them to the census data at MSG's zip code is silly.

      Time Between Readings

      Four hours was the most common delay between records. We are interested in rush hour traffic. Records spanning longer than four hours are too broad, so we also tossed these data.

      Records of vast swaths of time are too broad for our purposes.

      Records of vast swaths of time are too broad for our purposes.

      With the data thoroughly cleaned, calculating morning and evening rush hour was as easy as pie.

      Scoring Each Station

      With the datasets combined, each subway station could be ranked by average traffic, percent of the population within our demographic, and median income. We wanted to formulate a score for each station that a combination of each of these features.

      To form this score, we calculated the percentile rank for each station and zip code it is within. Our score is then a weighted sum of these percentile ranks.

      For our example recommendations, we weighted STEM women at 50%, station foot traffic at 30%, and median income at 20%. Below are the top 15 stations to visit during both morning and evening rush hour. 

      Evening Stations

      0: TIMES SQ-42 ST 1237ACENQRS
      1: WALL ST 23
      2: CHAMBERS ST ACE23
      3: WALL ST 45
      4: CHAMBERS ST 123
      6: 5 AVE 7BDFM
      9: FULTON ST 2345ACJZ
      10: RECTOR ST 1
      11: CHAMBERS ST JZ456
      13: 23 ST FM
      14: 23 ST NRW

      Morning Stations

      0: TIMES SQ-42 ST 1237ACENQRS
      1: 34 ST-PENN STA 123
      2: 34 ST-PENN STA ACE
      4: FULTON ST 2345ACJZ
      5: CHAMBERS ST ACE23
      8: WALL ST 23
      9: 23 ST FM
      10: CHAMBERS ST 123
      11: 42 ST-PORT AU ACE
      12: RECTOR ST 1
      14: 30 AV NQW


      Our example weights are by no means fixed. Our future deliverable will include a dashboard that allows the client to toggle the scoring weights. We don't really think visiting Times Square would be productive - its far too chaotic. The client could easily adjust weights to limit the effect of high foot traffic stations.