Assignment Two: Exploratory Data Analysis

Subtheme: Transit-Oriented Development

1.0 Overview

As a systems engineer, the topic of Transit-Oriented Development (TOD) stood out to me for three reasons. First, I’m fascinated with how the built environment, and particularly transit, affects how people interact with the world. Second, the flow of people from place to place reminds me of how sub-system interactions result in emergent behavior; I’m curious as to what behaviors we’ll be able to observe. And finally, trains (and buses and subway cars) are just so cool! From a data visualization perspective (and as someone who has long had a soft spot for the London Underground) I’m also reminded of how Tube maps have evolved over time (Figure 1); starting with to-scale geographic maps which gradually evolved into more stylistic representations which make concessions to reality for the sake of usefulness to the commuter. There are artistic motifs and philosophical approaches which we should adapt when analyzing TOD.

London Tube map evolution
Figure 1. Evolution of depictions of the London Tube. (Left) In 1908 the Underground Electric Railways Company of London produced a geographic map. While it makes sense on paper, the city center is cluttered while the periphery is largely blank. (Center) In 1933 draughtsman Harry Beck adopted the design principles of an electric schematic to produce a new simplified map focusing on usability. (Right) Today’s map published by Transit for London borrows heavily from Beck’s design, but with additional features such as fare zones and accessibility information. Images courtesy of Transit for London.

2.0 Overall analysis questions

I was drawn to several key questions (and follow-up questions) when conducting this analysis, inspired both by the readings as well as my own experiences as a Boston commuter. These questions were constructed prior to exporting the dataset; some I expect to be able to answer now, others will need to wait for the Final Project.

Q1.0: Do higher density areas near stations exhibit higher ridership per capita?

Q2.0: What factors influence station usage?

Q3.0: Which stations are good candidates for TOD?

Q4.0: How interconnected is Boston, and how does this change for people with limited mobility?

MIT to Wilson's Diner (26 min by car, 62 min by bus)
Figure 2. Google Maps directions between MIT and Wilson’s Diner for a well deserved 9am breakfast after a night spent in the lab. Yes, my sleep schedule is doing great.

Q5.0: How do neighborhood indicators change after TOD adoption, and do those changes align with common community fears (e.g., traffic, density, character)?

Columbia street housing
Figure 3. High density development in a historically relevant region. Separated by less than 500 feet, compare 44 Columbia street (Left) with Market Central apartments (Right). Both are near the Red Line and the 1 bus route.

Q6.0: How alive is Boston?

3.0 Dataset investigation

To answer these questions I’m looking at the publicly available MBTA General Transit Feed Specification dataset as concatenated by Riccardo Fiorista for MIT’s Interactive Visualization & Society course. It was analyzed using a mixture of Tableau and Python's matplotlib library. Along the way I also include some preliminary questions indicated with questions Q0.X and the answers A0.X described as part of the text.

3.1 Dataset variables and distribution

After reviewing the associated metadata and GeoJSON files, the dataset can be understood as a structured representation of the MBTA transit system organized into four primary layers: (1) transit stops, (2) transit lines connecting those stops, (3) a 50 meter buffer applied to the lines, and (4) a 0.5 mile (805 meter) buffer applied to the stops. Key variables are summarized in Table 1, combining topology, service-area geometries, and station features.

Table 1. Key variables contained within the MBTA stop and route datasets. Variables describe identification, hierarchy, service classification, accessibility, route linkage, and spatial geometry.
Variable Description Example
stop_name The common name for the transit stop. Central
stop_id Unique identifier for the stop. 70069
parent_station Identifies the broader station complex. Terminals with multiple departure points (e.g., bus stops or platforms) are grouped under the same parent station. place-cntsq
route_desc This is the primary service category for the data. In addition for Ferry and Rapid Transit (i.e., the 'T') labels it also includes areas like Rail Replacement Bus. Rapid Transit
zone_id Unlike in many cities which value things like 'what if we tracked everyone everywhere on the train', Boston doesn't typically engage in zone-based pricing. Entering a station to buy a donut and then leaving costs just as much as riding the entire Red Line. In this export, zone takes values like 'RapidTransit' rather than numeric fare zones, suggesting it is being used as a mode label. RapidTransit
wheelchair_boarding Indicates station accessibility status. 1 is accessible, 2 not accessible, 0 unknown. 1
level_id Identifies the vertical level of the stop (above or below ground). level_-1_platform
route_long_name The full route name that the stop belongs to. Red Line
route_color The hexadecimal color associated with the transit route. #DA291C
geometry GeoJSON geometry describing the stop location (Point) or service buffer (Polygon), depending on dataset version. POLYGON ((-71.09382 42.3653, -71.09401 42.36672, ...))

3.2 Initial data quality analysis

Because this is a geometry-heavy dataset, my first “gut check” was spatial: I plotted the full set of stops (points), routes (lines), and the provided buffer geometries (polygons) to confirm that the shapes align with real-world coordinates. Q0.1: Does our data look like the MBTA we know and (occasionally) love?

Three immediate observations come from Figure 4. First, there are no obvious coordinate errors: the data fall within the expected MBTA service region and there are no 'teleportation' artifacts (e.g., Null Island). Second (as we can see in the Salem insert), the route geometry is higher fidelity than a simple node-to-node network; lines trace plausible road-based routes, which increases confidence that this feed is suitable for spatial reasoning beyond a schematic graph. Third, the plotted extent includes Rhode Island. This is not a mistake: the MBTA’s commuter rail service actually does reach to Providence via a long-standing relationship. We assume that this is both to reduce congestion as well as a form of soft power whereby eventually the entire New England region will be part of the reborn City State of Boston. For the purposes of this assignment, I treat these out-of-state segments as valid data (they are part of the system), but have excluded them from analysis since land use, commuting patterns, and governance context differ. A0.1: Yes! We're plotting the correct MBTA!

MBTA dataset
Figure 4. Complete MBTA dataset showing stops (points), routes (lines), and the provided buffer geometries. An inset near Salem illustrates the relative scale of the 50 m route buffer versus the 0.5 mile station buffer, and shows how station buffers heavily overlap in dense corridors.

One additional design issue becomes apparent when comparing the two provided buffer scales: the station buffer (0.5 mile / 805 meters) is visually dominant, while the route buffer (50 meters) is nearly invisible at statewide scale and only becomes legible when zoomed in. At this early stage, I do not yet have a strong analytic reason to privilege either buffer (or for that matter, any buffer at all), and the station buffers often overlap heavily in dense areas, which complicates interpretation (e.g., double-counting “served area” if polygons are naively summed). For the remainder of the initial data review, I therefore treat these buffers as descriptive context rather than as the primary unit of measurement and have refrained from analyzing them further. In later phases I plan to compute distances directly to stops along the network rather than rely on the provided buffer polygons.

The map also reveals a major structural feature of the dataset that affects visualization choices: that of extreme spatial density in the urban core (many stops with overlapping buffers) and sparse coverage at the periphery. This is not necessarily an error; it reflects the real structure of the MBTA system, where the bus and rapid-transit networks create dense local coverage, while commuter rail extends long distances with comparatively few stations. Practically, this means that any all-in-one map suffers from overplotting in the core, while leaving large blank areas elsewhere. Now, let's include latitudinal and longitudinal histograms (Figure 5) to indicate spatial distribution of the stops in a way that overcomes overplotting. For subsequent plots, I use either zoomed insets or aggregate summaries (e.g., counts by category) to avoid misinterpretation of these artifacts.

To keep later comparisons meaningful, I define an analysis window that reflects where the dataset is most relevant for “bus + rapid transit + commuter rail” interactions. I anchor this window around the datum of South Street Diner, which not only (according to their website) makes excellent pancakes but also sits directly adjacent to South Station, the primary intermodal hub of the MBTA network. I define a circular analysis region with a 25 km radius (50 km diameter) centered on this point as the initial working extent. This circular boundary avoids the corner artifacts inherent in square bounding boxes and better approximates a defensible regional catchment area. As shown in Figure 5, this does a good job at capturing the majority of our dataset.

MBTA dataset with histogram and zones
Figure 5. Stops and lines in the MBTA dataset. To show station density for overplotting regions, longitudinal and latitudinal histograms are provided. Points within the 25 km radius South Street Diner (SSD) dataset are shown in yellow. Of the 7,087 MBTA stations, 6,126 (86%) are included in this region. Note that while most of the bus routes are included, some of those at the periphery are not and as such we may get strange network linking issues at these places.

Now that the geographic region of interest has been defined, we consider the distribution of service types. Q0.2: What is the most common transport mode when measured by number of stations? Figure 6 shows all stops categorized by service type. We observe a plausible mix of bus, rapid transit, and commuter rail stops and A0.2: the most popular transport mode is the local bus. However, the dataset also includes a substantial number of Rail Replacement Bus and Supplemental Bus stops. While these are correctly included in the feed, they represent temporary service patterns rather than permanent infrastructure. To avoid inflating mode counts with transient services, these categories are removed from subsequent analysis. The figure also reveals a surprisingly large number of bus stops (that's why we call it the Mostly Bus Transit Authority). Exploring the map in detail (as in the Quincy insert of Figure 7) we see that stops across the street from each other are being counted separately. While this may be appropriate for operational purposes (e.g., signage or traffic management), metre-scale separation between stops is not necessarily meaningful for network-level analysis. Initially, I investigated whether this consolidation was encoded via the parent_station field and plotted this in Figure 7. However, relatively few stops are structured in this way implying we should do some additional processing.

Types of service distribution
Figure 6. Distribution of stops by service category within 25 km of the South Street Diner datum. Similar categories are shown in similar colours and can be used for additional characterisation later (e.g., in the absence of quantitative frequency data, we may reasonably assume that a “Coverage Bus” has lower service frequency than a “Frequent Bus”). This plot functions as a categorical sanity check: a plausible mix of bus, rapid transit, and commuter rail stops suggests the feed is broadly complete.
Parent station encoding map
Figure 7. Station distribution map by type. Stops that contain a parent_station field are highlighted in red. Several instances of spatially adjacent stops (including bus-to-rail transfers) are not encoded as belonging to a shared station, suggesting that additional spatial filtering is required.

To begin our processing, Figure 8 shows the distribution of distances between adjacent stops. Many stops lie within a few tens of metres of one another, strongly suggesting that they represent components of a single functional station (e.g., opposing bus bays or bus-to-rail transfer interfaces). Based on this spatial clustering, I adopt a 100 m consolidation threshold. This approximately corresponds to typical downtown block spacing and represents a conservative walking-transfer distance. Stops located within 100 m of one another are merged into a single station entity. Where multiple service types are present, station type is assigned according to modal hierarchy (Rapid Transit > Regional Rail > Ferry > Bus). The resulting consolidated station map and distribution are shown in Figures 9 and 10. After filtering, 2,197 clusters with two or more stations were merged (4,770 stations in total). Of the original 6,049 stops the new dataset contains 3,476.

Station proximity histogram
Figure 8. Distance between adjacent stops. A large number of stops are separated by very small distances, indicating likely station clustering.
Merged station distribution
Figure 9. Station distribution after merging stops located within 100 m of one another. Consolidation substantially reduces redundant bus stops.
Final merged station map
Figure 10. Final consolidated station map. Merged stations are highlighted in red. The majority of the network contains clustered stops, consistent with expected multi-modal transfer environments.

Finally, to ensure that station consolidation did not distort network structure and to answer the question of Q0.3: Where are the most interconnected stations?, I evaluated stop connectivity by computing each node’s degree within the undirected route graph (Figure 11). In a coherent transit network, almost no nodes should have degree zero; terminal stops should cluster at degree one; intermediate stops at degree two; and major transfer hubs should form a long tail at higher degrees. The observed distribution follows this expected pattern, suggesting that the merging procedure preserved network topology. From this we can determine that A0.3: the top five hub locations are South Station, Quincy Center, Government Center, Ruggles, and Porter Square. This is roughly what I expected given how frequently I pass through them with Commuter Rail, T-stop, and bus line connections.

Connectivity degree distribution
Figure 11. Connectivity (degree) distribution across stops in the route network, computed as the number of distinct neighbouring stops connected by route segments. The expected pattern is concentration at low degrees with a long tail corresponding to transfer hubs; this functions as a topology check prior to accessibility analysis. Map included as a sanity check and for my own interest.

Overall, the dataset appears internally consistent and geographically plausible, with no immediate red flags such as coordinate corruption, missing categories, or disconnected subgraphs. This robustness is likely aided by prior preprocessing conducted by Fiorista. That said, the dataset is necessary but not sufficient for a full transit-oriented development (TOD) evaluation. Answering higher-level questions will eventually require integration of ridership data, reliability metrics, and land-use or demographic indicators (e.g., employment density, housing supply, and zoning characteristics). For now, however, we proceed using the available data to address preliminary analytical questions.

4.0 Discoveries and insights

I considered each of our six major questions in turn, exploring what the dataset could tell us on each.

Q1.0: Do higher density areas near stations exhibit higher ridership per capita?

At this stage we don't have sufficient data to answer this question. Our route information does not give us ridership numbers and we don't have a population density distribution. However, it does inspire me to ask a follow up question: Q1.1: What areas have the greatest public transportation density? We went some ways to answer this in Figure 5 where we were able to observe a dense inner core. However, that was using the full dataset (i.e., beyond the South Station Diner 25 km zone, and with the repeated bus stops). Let's repeat this but with our filtered dataset. Since I will be producing the rotated histograms again, let's use a 35km x 35km "square peg in a round hole" so that we don't unfairly bias the centre region. Since people care about nearby stations rather than the lines we'll just consider stops. This is shown in Figure 12. There is a clear westward bias to the data, which makes sense given that the East contains Boston Harbour. Surprisingly there is little difference between North and South, suggesting an even spread. This figure is alright; we have minimized overplotting and helped to draw attention to the cardinal directions of greatest development. Finding the modal point in latitude and longitude points us toward Malden having the greatest station density, but this inference is hard to make and doesn't take into account the type of transit (e.g., bus vs train). Even so, we can still answer our question A1.1: Boston's North-West has the greatest overall station density.

Station density modal map
Figure 12. Station density for the MBTA network. Rotated histograms illustrate the number of stations as a function of distance. North-South the network is roughly evenly distributed while in East-West it it is more clustered just west of the datum (i.e., between Boston and Cambridge). Rapid Transit spreads mostly eastward from the CBD while Ferry terminals are heavily clustered around the Waterfront. Due to plotting (and time) limitations the base map only includes rough geography; I'd have preferred to use a true map (similar to Figure 10).

The limitations of Figure 12 inspired me to consider a new question, Q1.2: What areas have the closest to good public transportation? Now we get to include some qualitative assessment on what being close to public transport means. I'll do network analysis in a later question, but for now lets consider walking distance to useful stops. I'll consider faster forms of transit as more useful. Based on my experience, "Rapid" Transit tends to move at around 30km/h (faster for the Red Line and slower for the Green line), Regional Rail 60km/h (technically it's allowed to go much faster, but there are stops), Ferry 20km/h, and Bus 15km/h. I'll define an arbitrary goodness function to take into account speed of service and distance to station as:

$$ \text{score} = \sum_{\text{nearest three unique service types}} \frac{\text{speed of service}}{(\text{distance to closest service})^2} $$

Initially I summed this for all service types but this tended to unfairly bias against locations further away from ferry terminals. Therefore I allowed us to just pick the nearest services. I completed this calculation across in 100 m x 100 m intervals across our 35 km x 35 km zone. Now we see that the best locations tend to cluster in a spoke pattern radiating from the CBD. Even with the speed-related penalty, buses do much of the heavy lifting when it comes to making an area accessible to public transit, which we can see with the strong correlation between the initial bus-only map and the combined multi-transit summation. Interestingly this view also helps to reveal transit deserts around the Jamaica Pond and Lexington regions (as well as the harbour, which for obvious reasons we don't care about) which could be helpful for subsequent analysis. To improve this data further I'd like to consider where people are getting to and from by following the more popular routes (e.g., easy access to the CBD is much more important than easy access to Harvard) but this will need to wait for future analysis. For the moment we can be content in answering A1.2: the areas along bus routes and the rapid transit hubs are the closest to good public transportation.

Station proximity by type
Figure 13. Station proximity by type for the MBTA network. Staggered opacities show the bottom 50% (transparent) and top 50-60%, 60-70%, 70%-80%, 80%-90%, and 90%-100% (fully opaque) locations.

Q2.0: What factors influence station usage?

While we don't have the statistics to directly infer station usage, we can infer that stations in walking distance are more likely to be used. Q1.1: Where are T-stops in walking distance? I typically find walking time more useful than distance (e.g., if it takes 5 minutes to walk to class, I can leave at 1pm and still get there for 1.05pm on-time for MIT-time!) so gave this figure in terms of time. Assuming a walking speed of 5 km/h I produced a plot of walking time to the nearest Rapid Transit stop. As I've been producing these figures I've appreciated the ability to zoom into my local region and attempted to replicate this with the online embedding my own map in Figure 14. During this analysis I found a slight misalignment between the coordinates being used and the underlying OpenStreetMap layer. From this we can see that A2.1: Downtown Boston, Brookline, Cambridge, and parts of Cedar Grove are all accessible with hubs in Revere Beach, Malden, and Braintree. Meanwhile, much of the rest like Watertown, Oak Hill, and Milton are inaccessible. Without driving a car or taking the bus to the station it is likely that station usage would be low.

Figure 14. Walking time to the nearest Rapid Transit station. Times are classified as low (less than 5 min, green), medium (less than 15 min, orange), high (less than 30 min, red), with those outside this range considered practically inaccessible and therefore not plotted. Colours were chosen to reflect traffic lights of “green for go” and “orange” for warning. I elected to use just three colours rather than a rainbow scale to make interpretation easier.

While this pre-empts some of the analysis in Question 4, we can also infer that a more interconnected network is more likely to be used. So, Q2.1: what does the MBTA network look like? We've already looked at this from a topographical perspective (stops on a map) but I'm interested in what it feels like as a commuter with different lines linking together. Since I had already had to do some network analysis to determine the statistics in Figure 11 I decided that with Figure 15 I'd look at mapping just the train-based options. Where in previous questions I used a standard colour scheme to separate the transit types, since we're now just concerned with Rapid (the "T") and Regional (Commuter Rail) transit options which have unique and identifiable colour schemes defined both in the dataset and public consciousness, I'll base my colours on these. As we can see in Figure 15, A2.1: The MBTA looks like a hub-and-spoke with the Boston CBD hub with Salem, Lowell, Leominster, Newton, Walpole, and Brockton as spokes radiating from this core.

Station proximity by type
Figure 15. MBTA lines and stops in the Rapid and Regional transit networks. Node stations where multiple lines converge are highlighted with large grey dots. Initially I plotted this on our recurring map, but this made the actual lines hard to read. While attempting to make the backdrop white I typed in #000000h rather than #FFFFFFh, resulting in a black background that actually made the lines (especially the "T") much easier to see. To this geographic layout I added arrows illustrating how the 'spokes' bring travellers into the central 'hub' of the city. I inverted the usual "hub and spoke" phase to "spoke and hub" so that the viewer could better follow travelers being brought into the city.

Given our hub and spoke model an immediate question is Q2.2: How do you move between adjacent spokes without passing through the hub?. The answer is of course buses! And of course lots of traffic. I filtered for bus routes that passed between adjacent lines and just plotted those. After some very busy graphs I restricted analysis to just the Rapid Transit routes. Sick of topographical-based plotting and inspired by the clean network layout from Harry Beck's original 1933 Tube map (Figure 1) I set about showing the network as a true network. Ideally I'd follow in Beck's footsteps and use the techniques of electrical circuit layout, but with limited time for hand-assembled graphics I turned to the NetworkX Python package to help with the layout; eventually electing for a web-like spoke model in Figure 16. While we can't identify individual bus stations we can clearly see the importance of buses. A2.2: Buses connect the spokes of the Boston MBTA together, allowing for transit without passing through the hub. This figure was a good start and very clearly communicates our solution, however I would have liked to place the lines in a more logical way so that we could reflect real-world structures (i.e., South Station hub, Orange Line bisected).

Line connection web
Figure 16. Bus connections between Rapid Transit spokes, showing the major lines. The graphic "T" signifies that we're examining just the Rapid Transit network while also including some much needed balance to the design.

This then inspired a new follow-up question; Q2.3: what's the most popular Rapid Transit station for bus connections? I exported the station name, number of bus lines, color, and location data from my Python network to Tableau to produce the bubble chart in Figure 17 and location-informed circles of Figure 18. These make it apparent that A2.3: the most popular stations are at the Congress Street @ North Station, Beacon Street @ Raleigh Street, Brookline Ave @ Newbury Street stations with popular stations clustering in the CBD with northerly, southerly, and westerly spokes.

Station proximity histogram
Figure 17. MBTA stations on Rapid Transit Lines by number of bus lines bubble chart. Size (and redundant numbering) encodes information about the number of connected bus lines, while colour indicates the associated line with white being shared hub stations. Initially I filtered for stations with more than 15 connections but appreciated the community-like cluster we get from including all the stations. I tried to arrange bubbles by geography (e.g., stacking more northerly stations to the north) but couldn't get this feature to run on Tableau (they have a very locked-down bubble sort algorithm) and I would need to program my own sorter to get this to work. Figure 18 is an indirect result of this location-based experimentation.
Merged station distribution
Figure 18. MBTA stations on Rapid Transit Lines by number of bus lines bubble chart. To reduce overplotting, only stations with more than 15 connecting buses were included. This did not prevent all overplotting and the CBD and Green Line are limited in usefulness. I included this anyway as a demonstration of Tableau's capabilities and (in particular) the perception-based scale for circle area.

Q3.0: Which stations are good candidates for TOD?

As previously discussed, World Bank provides features of node value (interconnectedness with the network), place value (attractiveness of the area around the station), and market potential value (future demand for land with respect to jobs and housing) as key areas to consider for TOD. We have already made a good start at quantifying node value in previous questions. However, areas of place value and market potential value are well outside the scope of our dataset. Therefore we will skip this question for this assignment and place our attention onto other areas.

Q4.0: How interconnected is Boston, and how does this change for people with limited mobility?

Considering the question of interconnectedness I'm reminded of the fact that all roads must lead to the South Station Diner and that, in general, we would expect that places in Boston that are more connected would have an easier time getting to any location. Thus one way of considering our question is Q4.1: How long does it take to get to South Station Diner? I split our map into a 100m x 100m grid and used a fill algorithm starting at the Diner and determining the minimum time to get to an adjacent tile. I reused my transit estimates of Rapid Transit at 30km/h, Regional Rail 60km/h, Ferry 20km/h, Bus 15km/h, and walking 5 km/h. To take into account network effects we could only transition from one form to another at station locations and had to reach a particular cell with walking. Wanting to encode connectivity using the medium of speed I animated the network growth starting from the SSD starting location, tracking how far we could go in 30 minutes. For how long it takes to get to the Diner A4.1: it depends where you are, but less than 30 minutes if you live on the spokes!. I think some of my connections and timings need to be adjusted slightly, but it's a good first attempt.

Travel time animation
Figure 19. Animated connectivity from South Station Diner. Motion and time encode the idea of how connected places are and highlight features of where different stations are located.

Considering the limited mobility component, Q4.2: How interconnected is Boston for those with limited mobility as compared to those with full mobility? I wanted to provide a resource with which one could immediately assess a region's accessibility. Reminded of the Figure 14 interactive plot, I returned to it by taking into account a reduced mobility speed while also checking for the stations' accessibility markers. This updated plot is shown in Figure 20. Encouragingly, it appears nearly all of the stations are accessible — although this does make the contrast between the plots less interesting. A single “inaccessible” station can be seen at the Packard’s Corner light rail Green Line stop. Initially I thought that this was a database issue, but to confirm I went past the station before class and confirmed that getting up to the light rail car (or even to the very small platform) would be a challenge with limited mobility. The Google reviews for the platform are also quite bad with an ominous “rat infestation at night” equal parts horrifying and amusing). A4.2: Boston has slightly restricted access for those with limited mobility, but as a result of slower pace of movement rather than fewer accessible stations.

Figure 20. Walking time to the nearest Rapid Transit station with full and limited mobility, changeable using a toggle. Times are classified as low (less than 5 min, green), medium (less than 15 min, orange), high (less than 30 min, red), with those outside this range considered practically inaccessible and therefore not plotted.

Q5.0: How do neighborhood indicators change after TOD adoption, and do those changes align with common community fears (e.g., traffic, density, character)?

As with Q3, there is insufficient data through which to answer this question. We have no indication of neighborhood indicators within the dataset and it includes no historical data or indication of TOD features.

Q6.0: How alive is Boston?

I had several ideas for this question, and indeed Figure 19 already looks like a city breathing in and out. Since we were limited in available data, I wanted to find a way of transcribing the mathematical data of nodes and lines into something more human — specifically music. Q6.1: What does the T sound like? I followed a train starting at Alewife and finishing at Braintree following the Red Line route. Each measure represents one station, with a descending scale for each stop and additional notes representing the major bus connections — specifically those which go to adjacent lines. With more time (and ideally a friend from Berklee), this would probably sound pretty interesting, especially when bringing in additional lines, new instruments, and refrains as one line repeats and then splits (e.g., Alewife to JFK to Ashmont; Alewife to JFK to Braintree). A6.1: The T sounds like retro arcade music played on a low-bit piano. There are periods of calm and periods of chaos.

Figure 19. The Red Line rendered as music. Each measure represents one station, with shorter notes indicating connecting bus routes. The pitch descends moving south from Alewife to Braintree. In the future pairing this with (neatly drawn) sheet music and the actual geography would help to bring across the concepts. For the moment though I just wanted to focus on one unique data channel.

5.0 Summary

In this exploratory analysis I investigated Transit-Oriented Development (TOD) in Greater Boston using the MBTA-derived stop and route geometries. I began by validating the dataset spatially (stops, lines, and buffer polygons) to confirm geographic plausibility and to identify structural issues that would affect interpretation, particularly heavy overplotting in the urban core and the presence of temporary service categories (e.g., rail replacement and supplemental bus). To keep comparisons meaningful, I defined a regional analysis window (25 km radius) around South Street Diner. A major transformation step was consolidating nearby stops into functional stations: many bus stops are separated only by tens of meters (often opposite-direction bays), and these duplicates dominate naive station counts. Using a 100 m clustering threshold and a simple modal hierarchy to assign station type I reduced redundant nodes while preserving network structure.

With this cleaned representation, I explored several questions about access and connectivity. Although the dataset cannot directly answer ridership-per-capita or neighborhood change questions, it supports useful proxy analyses. Station density highlights a core-and-spoke pattern woven together by bus lines: high-accessibility corridors radiate from the CBD, while transit deserts appear in several peripheral pockets. For rapid transit specifically, walking-time maps make the geographic limits of the subway network legible and suggest where low station proximity likely suppresses usage.

Overall, this work establishes a defensible cleaned station graph from a geometry-heavy feed, initial spatial and network-based measures of access and interconnectedness, and concrete next steps for a full TOD evaluation. We need to integrate ridership, service frequency, and land-use indicators to move from proximity-based proxies to outcome-based measures of TOD performance.

6.0 Bonus content

Well done making it to the end of this assignment (or just scrolling down here). This was a lot of work. Part of how I kept things manageable was the promise of visiting South Street Diner upon submission (Bonus Figure 1, photographed immediately before I had to run off to San Fransisco for a week, the assignment was submitted while in the airport). I also made some highly suspect memes (Bonus Figures 2–3) which, as I inflicted upon my flatmates, I will now inflict upon you. Until next time, James.

South Street Diner
Bonus Figure 1. A long expected pancake!
Why so many buses?
Bonus Figure 2. Why are there so many buses?
Spider train
Bonus Figure 3. I'm sorry. I saw trains, I saw a spider web, I needed a Red Line spider-train.