Analyzing NYC Motor Vehicle Collisions

Analyzing NYC Motor Vehicle Collisions#

Project 1#

Wuhao Xia#

In this project, we’ll perform a basic statistical analysis on a dataset of motor vehicle collisions in New York City. The goal is to practice data manipulation and analysis using two different methods:

The Easy Way: Using the powerful pandas library.

The Hard Way: Using only the Python standard library, with no external packages.

Finally, we’ll create a simple text-based visualization of our findings, again using only the standard library.

The Dataset The dataset used is “Motor Vehicle Collisions - Crashes” from the NYC Open Data portal. (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data)

It contains detailed records of motor vehicle collisions from 2012 to the present. For the dataset we use in this notebook, we only filtered the data from Jan 1st to Nov 1st in 2025.

Source: NYC Open Data Portal

File Used: Motor_Vehicle_Collisions_-_Crashes_20251101.csv

For this analysis, we will focus on a single numeric column: NUMBER OF PERSONS INJURED. This is a key metric for understanding the severity and human impact of these incidents.

import pandas as pd
import csv

# Define our file name and column name as constants. 
FILE_NAME = 'Motor_Vehicle_Collisions_-_Crashes_20251101.csv'
COLUMN_NAME = 'NUMBER OF PERSONS INJURED'

Part 1: The Pandas Way#

First, let’s use pandas. As we’ll see, pandas provides built-in functions that make these calculations trivial. We just need to read the CSV into a DataFrame, select our column (which pandas calls a Series), and then call the .mean(), .median(), and .mode() methods.

# Read the data into a pandas DataFrame

collision_df = pd.read_csv(FILE_NAME)
    
# Select the single column we want to analyze
data_series = collision_df[COLUMN_NAME]

# Compute the statistics
mean_pandas = data_series.mean()
median_pandas = data_series.median()
    
# .mode() returns a Series, so we take the first item [0]
mode_pandas = data_series.mode()[0]

print(f"Mean:   {mean_pandas:.4f}")
print(f"Median: {median_pandas}")
print(f"Mode:   {mode_pandas}")

Mean:   0.5837
Median: 0.0
Mode:   0

Part 2: The Hard Way (Standard Library Only)#

Now we will repeat the exact same calculations without pandas or any other external library.

The process will be:

Open and read the CSV file using the csv module.

Find the correct column index from the header row.

Iterate through every row, extract the value from our target column, and convert it to an integer.

We must add error handling (try-except) for this conversion, as some values might be missing or non-numeric (‘’, NA, etc.).

Store all valid numbers in a list.

Manually implement the algorithms for mean, median, and mode using this list.

numbers_list = []

# Read the file and extract data
with open(FILE_NAME, mode='r', encoding='utf-8') as f:
    reader = csv.reader(f)
        
    # Read the first row
    header = next(reader)
        
    # Find the column index for our target column
    col_index = header.index(COLUMN_NAME)
    
    for row in reader:
        val = int(row[col_index])
        numbers_list.append(val)

# Perform Calculations
count = len(numbers_list)

# Mean
total_sum = sum(numbers_list)
mean_hw = total_sum / count

# Median
sorted_numbers = sorted(numbers_list)
mid_index = count // 2 

if count % 2 == 0:
    # Even number of elements: average of the two middle numbers
    median_hw = (sorted_numbers[mid_index - 1] + sorted_numbers[mid_index]) / 2
else:
    # Odd number of elements: the middle number
    median_hw = sorted_numbers[mid_index]
        
# Mode
value_counts = {}
for num in numbers_list:
    # .get(num, 0) fetches the current count for 'num', or 0 if 'num' isn't in the dict yet
    value_counts[num] = value_counts.get(num, 0) + 1

    # Find the number (key) with the highest frequency
    mode_hw = max(value_counts, key=value_counts.get)

print(f"Mean:   {mean_hw:.4f}")
print(f"Median: {median_hw}")
print(f"Mode:   {mode_hw}")

Mean:   0.5837
Median: 0.0
Mode:   0

Part 3: Data Visualization (Standard Library Only)#

For the final step, we’ll create a simple data visualization. The requirement is to not use matplotlib, plotly, or any other plotting library.

A text-based horizontal bar chart is a nice fit. We’ll use the value_counts we just calculated.

Because there are many possible values, the chart will be more readable if we group large values together.

Thus, we will create categories for: 0, 1, 2, 3, 4, and 5+.

We’ll then scale these counts to fit a maximum width and print them using a block character (█).

# Group the data into bins
viz_counts = {
    '0': value_counts.get(0, 0),
    '1': value_counts.get(1, 0),
    '2': value_counts.get(2, 0),
    '3': value_counts.get(3, 0),
    '4': value_counts.get(4, 0),
    '5+': 0 
}

# Loop through the original counts to sum all values 5 or greater
for num, count in value_counts.items():
    if num >= 5:
        viz_counts['5+'] += count

# 2. Find the scaling factor
# Find the largest count to scale the bars
max_count = max(viz_counts.values())
max_bar_width = 50  # Max width of our chart in characters

# Calculate the scale, handling the case of max_count being 0
scale_factor = max_bar_width / max_count if max_count > 0 else 0
scale_unit = (max_count / max_bar_width) if max_count > 0 else 0

# 3. Draw the chart
bar_symbol = "█"
labels = ['0', '1', '2', '3', '4', '5+']

# --- Add the required Title ---
print("\nDistribution of Persons Injured in NYC Collisions")
print("-------------------------------------------------")

for label in labels:
    count = viz_counts[label]
        
    # Calculate bar length based on scale
    bar_length = int(count * scale_factor)
    bar = bar_symbol * bar_length
        
    # Print the line with f-string formatting for alignment
    # {label:<3} = 3-char wide, left-aligned
    # {count:>7,} = 7-char wide, right-aligned, with comma separator
    print(f"Injuries {label:<3} | {count:>7,} | {bar}")

# Print the scale/legend
print(f"\nScale: One '{bar_symbol}' represents ~{scale_unit:,.0f} incidents.")

Distribution of Persons Injured in NYC Collisions
-------------------------------------------------
Injuries 0   |  39,218 | ██████████████████████████████████████████████████
Injuries 1   |  23,758 | ██████████████████████████████
Injuries 2   |   4,324 | █████
Injuries 3   |   1,461 | █
Injuries 4   |     454 | 
Injuries 5+  |     333 | 

Scale: One '█' represents ~784 incidents.