AI Occupational Exposure and Wage Analysis#

Project 3#

Dataset(s) to be used:

Analysis question: Is there a significant correlation between an occupation’s exposure to Artificial Intelligence (as measured by the AIOE score) and its annual mean wage? Specifically, do “high-exposure” jobs correlate with higher or lower financial compensation? Furthermore, which specific industries are most vulnerable (or empowered) by this technology?

Columns that will be used:

  • From AIOE Dataset: SOC Code, AIOE, Occupation Title

  • From BLS Dataset: OCC_CODE, A_MEAN, O_GROUP, OCC_TITLE

Columns to be used to merge/join them:

  • [AIOE Dataset] SOC Code

  • [BLS Dataset] OCC_CODE

Hypothesis: I hypothesize that there is a positive correlation between AI exposure and annual wages. I expect that AI is currently “skill-biased,” meaning it complements complex, high-paying professions (like Tech and Finance) rather than replacing low-wage manual labor.

1. Introduction: The “AI Anxiety” Paradox#

In recent years, the rapid advancement of Artificial Intelligence (AI) has sparked a global debate: Will AI take our jobs? Common wisdom suggests that automation threatens low-skilled labor the most. However, Generative AI (like ChatGPT) seems different—it can write code, analyze laws, and create art. This raises a new question: Is AI actually coming for the white-collar, high-paying jobs?

In this project, I will analyze the relationship between AI Occupational Exposure (AIOE) and Annual Wages. By combining academic research data with official U.S. labor statistics, we will visualize whether AI is a threat to the poor or a tool for the rich.

import plotly.io as pio

pio.renderers.default = "notebook_connected+plotly_mimetype"

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

# Load Data
bls_df = pd.read_csv('national_M2023_dl.csv')
aioe_df = pd.read_csv('AIOE_DataAppendix Appendix A.csv')

print(f"BLS Raw Data Shape: {bls_df.shape}")
print(f"AIOE Raw Data Shape: {aioe_df.shape}")
BLS Raw Data Shape: (1403, 32)
AIOE Raw Data Shape: (774, 3)

2. Data Cleaning and Preparation#

Before analysis, we must align two very different datasets. The BLS data requires significant filtering because it contains aggregate rows (like “All Occupations”) that would distort our analysis if treated as single data points.

Key Steps:

  1. Filter for Detail: We only keep rows where O_GROUP is ‘detailed’.

  2. String Sanitization: We strip whitespace from the OCC_CODE and SOC Code columns to ensure they match perfectly during the merge.

  3. Type Conversion: Wage data often comes as strings (e.g., “120,000” or “*”); we convert these to numeric values, treating non-numeric symbols as missing data.

# Clean BLS Data
bls_df.columns = bls_df.columns.str.strip()

# Filter for detailed occupations only
bls_detailed = bls_df[bls_df['O_GROUP'].astype(str).str.strip() == 'detailed'].copy()

# Clean Wage Column: Remove commas and convert to numeric
bls_detailed['A_MEAN'] = bls_detailed['A_MEAN'].astype(str).str.replace(',', '', regex=False)
bls_detailed['A_MEAN'] = pd.to_numeric(bls_detailed['A_MEAN'], errors='coerce')

# Drop missing wage data
bls_cleaned = bls_detailed.dropna(subset=['A_MEAN']).copy()
bls_cleaned['OCC_CODE'] = bls_cleaned['OCC_CODE'].astype(str).str.strip()

# Clean AIOE Data
aioe_df['SOC Code'] = aioe_df['SOC Code'].astype(str).str.strip()

# Merge Datasets
merged_df = pd.merge(
    bls_cleaned, 
    aioe_df, 
    left_on='OCC_CODE', 
    right_on='SOC Code', 
    how='inner'
)

print(f"Merged Dataset contains {len(merged_df)} unique occupations.")
Merged Dataset contains 670 unique occupations.

3. Feature Engineering: Decoding the Industry Codes#

The Standard Occupational Classification (SOC) system uses codes like 15-2051. The first two digits (15) represent the Major Group (Industry).

To make our charts readable for a general audience, I will extract these codes and map them to their actual industry names (e.g., mapping “15” to “Computer & Math”).

# Extract the first two digits
merged_df['Major Group Code'] = merged_df['OCC_CODE'].str[:2]

# Dictionary to map SOC codes to Human-Readable Industry Names
soc_mapping = {
    '11': 'Management', '13': 'Business & Financial', '15': 'Computer & Math',
    '17': 'Architecture & Engineering', '19': 'Life, Physical, Social Science',
    '21': 'Community & Social Service', '23': 'Legal', '25': 'Education & Library',
    '27': 'Arts, Design, Entertainment', '29': 'Healthcare Practitioners',
    '31': 'Healthcare Support', '33': 'Protective Service',
    '35': 'Food Preparation', '37': 'Building Cleaning',
    '39': 'Personal Care', '41': 'Sales', '43': 'Office & Admin',
    '45': 'Farming & Fishing', '47': 'Construction',
    '49': 'Installation & Repair', '51': 'Production', '53': 'Transportation'
}

# Map the codes to names
merged_df['Industry Name'] = merged_df['Major Group Code'].map(soc_mapping)

# Create Wage Tiers (Quartiles) for later analysis
merged_df['Wage Tier'] = pd.qcut(merged_df['A_MEAN'], q=4, labels=['Low Income', 'Medium Income', 'High Income', 'Top Earners'])

merged_df[['OCC_TITLE', 'Industry Name', 'A_MEAN', 'Wage Tier']].head()
OCC_TITLE Industry Name A_MEAN Wage Tier
0 Chief Executives Management 258900.0 Top Earners
1 General and Operations Managers Management 129330.0 Top Earners
2 Advertising and Promotions Managers Management 152620.0 Top Earners
3 Marketing Managers Management 166410.0 Top Earners
4 Sales Managers Management 157610.0 Top Earners

4. Analysis Part I: The Big Picture#

Statistical Correlation#

First, we calculate the Pearson Correlation Coefficient. This statistic measures the strength of the linear relationship between two variables.

  • A value of 1.0 means a perfect positive relationship.

  • A value of 0 means no relationship.

Visualization#

The scatter plot below allows us to visualize every occupation.

  • X-Axis: AI Exposure (How much AI impacts the job).

  • Y-Axis: Annual Wage.

  • Color: Industry (Using our new readable names).

# Calculate Correlation
corr = merged_df['AIOE'].corr(merged_df['A_MEAN'])
print(f"Pearson Correlation Coefficient: {corr:.4f}")

# Prepare Trendline Data
x = merged_df['AIOE']
y = merged_df['A_MEAN']
m, b = np.polyfit(x, y, 1)

# Create Plotly Scatter
fig = px.scatter(
    merged_df,
    x='AIOE',
    y='A_MEAN',
    color='Industry Name',  
    hover_name='OCC_TITLE',
    hover_data={'Industry Name': False, 'AIOE': ':.2f', 'A_MEAN': ':$,.0f'},
    title=f'AI Exposure vs. Wages (Correlation: {corr:.2f})',
    labels={'AIOE': 'AI Exposure Score', 'A_MEAN': 'Annual Mean Wage'},
    height=700
)

# Add the trendline
fig.add_trace(
    go.Scatter(x=x, y=m*x + b, mode='lines', name='Trend Line', line=dict(color='black', dash='dash'))
)

fig.update_layout(legend=dict(yanchor="top", y=0.99, xanchor="left", x=1.02))
fig.show()
Pearson Correlation Coefficient: 0.4668

5. Analysis Part II: Which Industries are Most Exposed?#

The scatter plot is dense. To see the “forest for the trees,” let’s aggregate the data by industry.

We will calculate the average AI Exposure for each industry and sort them. This will clearly show which sectors are effectively “AI-Safe” vs. “AI-Prone.”

# Group by Industry and calculate means
industry_stats = merged_df.groupby('Industry Name')[['AIOE', 'A_MEAN']].mean().reset_index()

# Sort by AI Exposure
industry_stats = industry_stats.sort_values(by='AIOE', ascending=True)

# Create Bar Chart
fig_bar = px.bar(
    industry_stats,
    x='AIOE',
    y='Industry Name',
    orientation='h', 
    color='A_MEAN',  
    color_continuous_scale='Viridis',
    title='Average AI Exposure by Industry, Colored by Average Wage',
    labels={'AIOE': 'Average AI Exposure', 'Industry Name': '', 'A_MEAN': 'Avg Wage'},
    height=600
)

fig_bar.show()

6. Analysis Part III: Do Richer Workers Use More AI?#

Finally, let’s break the workforce into four income quartiles: Low, Medium, High, and Top Earners.

The Box Plot below shows the distribution of AI Exposure scores for each income group. If the box moves “up” as we move to higher income tiers, it confirms that AI is a tool for the wealthy.

# Define order for the plot
tier_order = ['Low Income', 'Medium Income', 'High Income', 'Top Earners']

fig_box = px.box(
    merged_df,
    x='Wage Tier',
    y='AIOE',
    color='Wage Tier',
    category_orders={'Wage Tier': tier_order},
    title='Distribution of AI Exposure Across Wage Tiers',
    labels={'AIOE': 'AI Exposure Score'},
    points='outliers' 
)

fig_box.show()

7. Conclusion and Takeaways#

Through this multi-layered analysis, we have uncovered a clear structural trend in the U.S. labor market regarding Artificial Intelligence.

Key Findings:

  1. The Correlation (0.47): There is a moderate-to-strong positive link between wages and AI exposure. The trendline in our scatter plot confirms that as wages go up, AI exposure typically increases.

  2. Industry Divide: As shown in the bar chart, “Blue-collar” sectors like Construction, Food Prep, and Transportation have the lowest AI exposure. Conversely, Computer & Math and Legal professions—which are also high-paying—are at the top.

  3. Income Tier Reality: The box plot illustrates that “Top Earners” have a significantly higher median AI exposure than “Low Income” workers.

Final Thought: The data suggests that AI is not currently an “automation apocalypse” for low-wage workers. Instead, it is a productivity shock for the white-collar elite. The challenge for policymakers is not just protecting jobs, but ensuring that the efficiency gains from AI in high-paying sectors are shared broadly, rather than leading to further wage polarization.