Data Source Integration Guide
This guide provides instructions for integrating the new data sources into the Digital/AI Jobs Dashboard.
New Data Sources
1. Anthropic EconomicIndex Dataset
Installation:
pip install datasetsIntegration Code:
from datasets import load_dataset
import pandas as pd
# Load the dataset
dataset = load_dataset("Anthropic/EconomicIndex", "release_2025_09_15")
# Convert to pandas DataFrame
df = dataset['train'].to_pandas()
# Process and integrate into DuckDB
# (Add to load_data.py)Data Structure: - Economic indicators related to AI adoption - Time series data - Country-level metrics
Integration Steps: 1. Add datasets to requirements.txt 2. Create function load_anthropic_economic_index() in load_data.py 3. Process and merge with existing digital_jobs table 4. Create new views for economic indicators
2. Stanford AI Index Report 2025
Access Method: - Public data available via AI Index Public Data Portal - Download datasets or use API if available
Key Data Points: - Private AI investment by country ($109.1B US, $9.3B China, $4.5B UK in 2024) - AI adoption rates (78% of organizations using AI in 2024) - Model development statistics - Performance benchmarks
Integration Steps: 1. Download or access public data from Stanford HAI 2. Extract relevant metrics (investment, adoption, job impacts) 3. Create function load_stanford_ai_index() in load_data.py 4. Map to country codes and integrate with existing data
Example Integration:
def load_stanford_ai_index():
"""Load Stanford AI Index data."""
# Download or access data
# Process investment data by country
# Extract adoption and job impact metrics
# Return DataFrame compatible with existing schema
pass3. PwC AI Jobs Barometer
Access Method: - Download PDF reports from PwC AI Jobs Barometer - Extract data from tables and charts - Use PDF parsing libraries or manual data extraction
Key Data Points: - AI job posting trends by industry - Skills demand analysis - Regional job market variations - Supply and demand dynamics
Integration Steps: 1. Download 2025 and 2024 reports 2. Extract structured data (consider using pdfplumber or tabula-py) 3. Create function load_pwc_ai_jobs_barometer() in load_data.py 4. Map to existing industry and skill taxonomies
Example Integration:
import pdfplumber
def load_pwc_ai_jobs_barometer():
"""Extract data from PwC AI Jobs Barometer PDFs."""
# Read PDF files
# Extract tables and charts
# Process into structured format
# Return DataFrame
passDependencies:
pip install pdfplumber tabula-py4. Yale Budget Lab - AI Labor Market Impact
Access Method: - Access research publications - Extract datasets from research papers - Contact researchers for data access
Key Data Points: - Labor market displacement metrics - Job creation statistics - Workforce transition data - Economic impact assessments
Integration Steps: 1. Review research publications 2. Extract quantitative data 3. Create function load_yale_budget_lab() in load_data.py 4. Integrate with labor market indicators
5. McKinsey - Economic Potential of Generative AI
Access Method: - Access research reports and articles - Extract data from visualizations and tables - Use web scraping for structured data (with permission)
Key Data Points: - Economic value potential by industry - Productivity gains metrics - Workforce transformation data - Skills requirements
Integration Steps: 1. Review McKinsey reports 2. Extract structured data 3. Create function load_mckinsey_generative_ai() in load_data.py 4. Map to industry and skill categories
Integration Workflow
Step 1: Update Requirements
Add new dependencies to requirements.txt:
datasets>=2.14.0
pdfplumber>=0.9.0
tabula-py>=2.5.0Step 2: Create Loader Functions
Add functions to load_data.py:
def load_anthropic_economic_index():
"""Load Anthropic EconomicIndex dataset."""
# Implementation
pass
def load_stanford_ai_index():
"""Load Stanford AI Index data."""
# Implementation
pass
# ... etcStep 3: Update Database Schema
Add new tables to DuckDB:
# In create_database() function
conn.execute("""
CREATE TABLE anthropic_economic_index AS
SELECT * FROM anthropic_df
""")
conn.execute("""
CREATE TABLE stanford_ai_index AS
SELECT * FROM stanford_df
""")Step 4: Create Aggregated Views
Create views that combine new data with existing data:
conn.execute("""
CREATE VIEW enhanced_country_trends AS
SELECT
c.*,
a.economic_index,
s.ai_investment,
s.ai_adoption_rate
FROM country_trends c
LEFT JOIN anthropic_economic_index a ON c.country_code = a.country_code
LEFT JOIN stanford_ai_index s ON c.country_code = s.country_code
""")Step 5: Update Dashboard
Add new visualizations and metrics to app.py: - New charts showing AI investment trends - Economic index overlays - Enhanced country comparisons
Data Mapping
Country Code Mapping
Ensure consistent country codes across all sources: - World Bank: ISO 3-letter codes (USA, CHN, IND, etc.) - Stanford AI Index: May use different codes - map to ISO - PwC: May use country names - map to ISO codes - Anthropic: Check dataset documentation for code format
Industry Mapping
Map industry categories to existing taxonomy: - Information Technology - Financial Services - Manufacturing - Healthcare - Education - Retail - Telecommunications - Professional Services
Skill Type Mapping
Map skill categories: - AI/ML Engineering - Data Science - Software Development - Cybersecurity - Cloud Computing - Digital Marketing - Data Analytics - IT Support
Example: Integrating Anthropic EconomicIndex
# In load_data.py
from datasets import load_dataset
def load_anthropic_economic_index():
"""Load Anthropic EconomicIndex dataset from Hugging Face."""
print("Loading Anthropic EconomicIndex dataset...")
try:
# Load dataset
dataset = load_dataset("Anthropic/EconomicIndex", "release_2025_09_15")
# Convert to pandas
df = dataset['train'].to_pandas()
# Process and clean data
# Map country codes if needed
# Align with existing schema
print(f" Loaded {len(df)} records from Anthropic EconomicIndex")
return df
except Exception as e:
print(f" Error loading Anthropic EconomicIndex: {e}")
return None
# In create_database() function:
anthropic_df = load_anthropic_economic_index()
if anthropic_df is not None:
conn.execute("CREATE TABLE anthropic_economic_index AS SELECT * FROM anthropic_df")Data Quality Checks
Before integrating new sources:
- Data Validation:
- Check for missing values
- Validate country codes
- Verify date ranges
- Check for duplicates
- Schema Alignment:
- Ensure consistent column names
- Align data types
- Map categorical variables
- Data Completeness:
- Check coverage by country
- Verify time series continuity
- Identify gaps
- Integration Testing:
- Test joins with existing tables
- Verify aggregated views
- Check dashboard performance
Next Steps
- Priority 1: Integrate Anthropic EconomicIndex (easiest - direct dataset access)
- Priority 2: Extract and integrate Stanford AI Index public data
- Priority 3: Process PwC AI Jobs Barometer reports
- Priority 4: Integrate Yale and McKinsey research data
Support
For questions about integration: - Check source documentation - Review dataset schemas - Test with sample data first - Validate against existing data
MCP Server Integration
A Model Context Protocol (MCP) server has been created to provide programmatic access to all data sources.
Using the MCP Server
The MCP server is located in mcp_server/ directory and provides tools to fetch data from all sources.
Installation:
cd mcp_server
pip install -r requirements.txtAvailable Tools: - get_anthropic_economic_index - Fetch Anthropic EconomicIndex dataset - get_stanford_ai_index - Get Stanford AI Index metrics - get_world_bank_indicator - Fetch World Bank indicator data - get_itu_ict_data - Get ITU ICT data - get_pwc_ai_jobs_data - Get PwC AI Jobs Barometer info - get_yale_budget_lab_info - Get Yale Budget Lab info - get_mckinsey_generative_ai_info - Get McKinsey info - list_available_data_sources - List all sources
Example Usage:
# See mcp_server/client_example.py for full examples
from mcp import ClientSession, StdioServerParameters
from mcp.client.stdio import stdio_client
# Use the MCP server to fetch data
# See mcp_server/integrate_with_dashboard.py for integration examplesIntegration with Dashboard:
# Run integration script
python mcp_server/integrate_with_dashboard.pyFor more details, see mcp_server/README.md.
Last Updated
January 2025