Code
import requests
import duckdb
import altair as alt
from pathlib import PathThis notebook downloads World Bank labor force participation data and creates a simple visualization showing trends by region over time.
What you’ll learn: - How to download data from the World Bank - How to use DuckDB for fast CSV loading and SQL analysis - How to create charts with Altair
Required packages:
pip install requests duckdb pandas altair
We’ll download two files from the World Bank: 1. Indicator data - The actual labor force participation rates 2. Dictionary data - Metadata about countries and regions
# URLs for World Bank data
INDICATOR_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS.csv"
DICTIONARY_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS_DATADICT.csv"
# Local file paths
INDICATOR_FILE = DATA_DIR / "labor_force_data.csv"
DICTIONARY_FILE = DATA_DIR / "data_dictionary.csv"# Download indicator data (skip if already exists)
if not INDICATOR_FILE.exists():
print("Downloading indicator data...")
response = requests.get(INDICATOR_URL, timeout=60)
response.raise_for_status()
INDICATOR_FILE.write_bytes(response.content)
print(f"Saved to {INDICATOR_FILE}")
else:
print(f"Using cached file: {INDICATOR_FILE}")Downloading indicator data...
Saved to data/labor_force_data.csv
# Download dictionary data (skip if already exists)
if not DICTIONARY_FILE.exists():
print("Downloading dictionary data...")
response = requests.get(DICTIONARY_URL, timeout=60)
response.raise_for_status()
DICTIONARY_FILE.write_bytes(response.content)
print(f"Saved to {DICTIONARY_FILE}")
else:
print(f"Using cached file: {DICTIONARY_FILE}")Downloading dictionary data...
Saved to data/data_dictionary.csv
DuckDB can read CSV files directly - no need to load into pandas first! This is faster and more memory efficient for large files.
Connected to DuckDB: data/worldbank.duckdb
Loaded indicator_raw: 8,216 rows
Loaded dictionary: 42 rows
Let’s take a quick look at what we loaded using DuckDB queries.
| STRUCTURE | STRUCTURE_ID | ACTION | FREQ | FREQ_LABEL | REF_AREA | REF_AREA_LABEL | INDICATOR | INDICATOR_LABEL | SEX | ... | DATA_SOURCE_LABEL | UNIT_TYPE | UNIT_TYPE_LABEL | TIME_FORMAT | TIME_FORMAT_LABEL | COMMENT_OBS | OBS_STATUS | OBS_STATUS_LABEL | OBS_CONF | OBS_CONF_LABEL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | datastructure | WB.DATA360:DS_DATA360(1.3) | I | A | Annual | AFE | Africa Eastern and Southern | WB_WDI_SL_TLF_CACT_ZS | Labor force participation rate, total (% of to... | _T | ... | World Development Indicators (WDI) | RATIO | Ratio | P1Y | Annual | None | A | Normal value | PU | Public |
| 1 | datastructure | WB.DATA360:DS_DATA360(1.3) | I | A | Annual | AFW | Africa Western and Central | WB_WDI_SL_TLF_CACT_ZS | Labor force participation rate, total (% of to... | _T | ... | World Development Indicators (WDI) | RATIO | Ratio | P1Y | Annual | None | A | Normal value | PU | Public |
| 2 | datastructure | WB.DATA360:DS_DATA360(1.3) | I | A | Annual | ARB | Arab World | WB_WDI_SL_TLF_CACT_ZS | Labor force participation rate, total (% of to... | _T | ... | World Development Indicators (WDI) | RATIO | Ratio | P1Y | Annual | None | A | Normal value | PU | Public |
| 3 | datastructure | WB.DATA360:DS_DATA360(1.3) | I | A | Annual | CSS | Caribbean small states | WB_WDI_SL_TLF_CACT_ZS | Labor force participation rate, total (% of to... | _T | ... | World Development Indicators (WDI) | RATIO | Ratio | P1Y | Annual | None | A | Normal value | PU | Public |
| 4 | datastructure | WB.DATA360:DS_DATA360(1.3) | I | A | Annual | CEB | Central Electricity Board (CEB) | WB_WDI_SL_TLF_CACT_ZS | Labor force participation rate, total (% of to... | _T | ... | World Development Indicators (WDI) | RATIO | Ratio | P1Y | Annual | None | A | Normal value | PU | Public |
5 rows × 45 columns
| column_name | column_type | null | key | default | extra | |
|---|---|---|---|---|---|---|
| 0 | STRUCTURE | VARCHAR | YES | None | None | None |
| 1 | STRUCTURE_ID | VARCHAR | YES | None | None | None |
| 2 | ACTION | VARCHAR | YES | None | None | None |
| 3 | FREQ | VARCHAR | YES | None | None | None |
| 4 | FREQ_LABEL | VARCHAR | YES | None | None | None |
| 5 | REF_AREA | VARCHAR | YES | None | None | None |
| 6 | REF_AREA_LABEL | VARCHAR | YES | None | None | None |
| 7 | INDICATOR | VARCHAR | YES | None | None | None |
| 8 | INDICATOR_LABEL | VARCHAR | YES | None | None | None |
| 9 | SEX | VARCHAR | YES | None | None | None |
| 10 | SEX_LABEL | VARCHAR | YES | None | None | None |
| 11 | AGE | VARCHAR | YES | None | None | None |
| 12 | AGE_LABEL | VARCHAR | YES | None | None | None |
| 13 | URBANISATION | VARCHAR | YES | None | None | None |
| 14 | URBANISATION_LABEL | VARCHAR | YES | None | None | None |
| 15 | UNIT_MEASURE | VARCHAR | YES | None | None | None |
| 16 | UNIT_MEASURE_LABEL | VARCHAR | YES | None | None | None |
| 17 | COMP_BREAKDOWN_1 | VARCHAR | YES | None | None | None |
| 18 | COMP_BREAKDOWN_1_LABEL | VARCHAR | YES | None | None | None |
| 19 | COMP_BREAKDOWN_2 | VARCHAR | YES | None | None | None |
| 20 | COMP_BREAKDOWN_2_LABEL | VARCHAR | YES | None | None | None |
| 21 | COMP_BREAKDOWN_3 | VARCHAR | YES | None | None | None |
| 22 | COMP_BREAKDOWN_3_LABEL | VARCHAR | YES | None | None | None |
| 23 | TIME_PERIOD | BIGINT | YES | None | None | None |
| 24 | OBS_VALUE | DOUBLE | YES | None | None | None |
| 25 | AGG_METHOD | VARCHAR | YES | None | None | None |
| 26 | AGG_METHOD_LABEL | VARCHAR | YES | None | None | None |
| 27 | DECIMALS | BIGINT | YES | None | None | None |
| 28 | DECIMALS_LABEL | VARCHAR | YES | None | None | None |
| 29 | DATABASE_ID | VARCHAR | YES | None | None | None |
| 30 | DATABASE_ID_LABEL | VARCHAR | YES | None | None | None |
| 31 | COMMENT_TS | VARCHAR | YES | None | None | None |
| 32 | UNIT_MULT | BIGINT | YES | None | None | None |
| 33 | UNIT_MULT_LABEL | VARCHAR | YES | None | None | None |
| 34 | DATA_SOURCE | VARCHAR | YES | None | None | None |
| 35 | DATA_SOURCE_LABEL | VARCHAR | YES | None | None | None |
| 36 | UNIT_TYPE | VARCHAR | YES | None | None | None |
| 37 | UNIT_TYPE_LABEL | VARCHAR | YES | None | None | None |
| 38 | TIME_FORMAT | VARCHAR | YES | None | None | None |
| 39 | TIME_FORMAT_LABEL | VARCHAR | YES | None | None | None |
| 40 | COMMENT_OBS | VARCHAR | YES | None | None | None |
| 41 | OBS_STATUS | VARCHAR | YES | None | None | None |
| 42 | OBS_STATUS_LABEL | VARCHAR | YES | None | None | None |
| 43 | OBS_CONF | VARCHAR | YES | None | None | None |
| 44 | OBS_CONF_LABEL | VARCHAR | YES | None | None | None |
| VARIABLE_ID | VARIABLE_LABEL | VARIABLE_DESCRIPTION | VARIABLE_DATA_TYPE | VARIABLE_REQUIRED | |
|---|---|---|---|---|---|
| 0 | FREQ_LABEL | Frequency of observation label | Time interval at which observations occur over... | String | True |
| 1 | URBANISATION_LABEL | Degree of urbanisation label | Refers to Total, Urban, or Rural location. | String | True |
| 2 | COMP_BREAKDOWN_1_LABEL | Custom Dimension 1 label | None | String | True |
| 3 | COMP_BREAKDOWN_3_LABEL | Custom Dimension 3 label | None | String | True |
| 4 | AGG_METHOD_LABEL | Aggregation method label | None | String | False |
| column_name | column_type | null | key | default | extra | |
|---|---|---|---|---|---|---|
| 0 | VARIABLE_ID | VARCHAR | YES | None | None | None |
| 1 | VARIABLE_LABEL | VARCHAR | YES | None | None | None |
| 2 | VARIABLE_DESCRIPTION | VARCHAR | YES | None | None | None |
| 3 | VARIABLE_DATA_TYPE | VARCHAR | YES | None | None | None |
| 4 | VARIABLE_REQUIRED | BOOLEAN | YES | None | None | None |
Create a clean, analysis-ready table with proper data types and region info.
# Create a cleaned indicator table
conn.execute("""
CREATE OR REPLACE TABLE indicator_clean AS
SELECT
REF_AREA AS country_code,
REF_AREA_LABEL AS country_name,
CAST(TIME_PERIOD AS INTEGER) AS year,
CAST(OBS_VALUE AS DOUBLE) AS value,
INDICATOR_LABEL AS indicator_name
FROM indicator_raw
WHERE OBS_VALUE IS NOT NULL
AND TIME_PERIOD IS NOT NULL
""")
row_count = conn.execute("SELECT COUNT(*) FROM indicator_clean").fetchone()[0]
print(f"Cleaned data: {row_count:,} rows")Cleaned data: 8,216 rows
| country_code | country_name | year | value | indicator_name | |
|---|---|---|---|---|---|
| 0 | AFE | Africa Eastern and Southern | 1990 | 72.111549 | Labor force participation rate, total (% of to... |
| 1 | AFW | Africa Western and Central | 1990 | 76.593802 | Labor force participation rate, total (% of to... |
| 2 | ARB | Arab World | 1990 | 47.597590 | Labor force participation rate, total (% of to... |
| 3 | CSS | Caribbean small states | 1990 | 59.731386 | Labor force participation rate, total (% of to... |
| 4 | CEB | Central Electricity Board (CEB) | 1990 | 57.682804 | Labor force participation rate, total (% of to... |
Dictionary table columns:
column_name column_type null key default extra
0 VARIABLE_ID VARCHAR YES None None None
1 VARIABLE_LABEL VARCHAR YES None None None
2 VARIABLE_DESCRIPTION VARCHAR YES None None None
3 VARIABLE_DATA_TYPE VARCHAR YES None None None
4 VARIABLE_REQUIRED BOOLEAN YES None None None
# The World Bank data dictionary file describes variables, not country metadata.
# We'll create a simple region mapping based on country codes.
# For a real project, you'd download a proper country-region mapping file.
# Create a region mapping table based on common World Bank region codes
conn.execute("""
CREATE OR REPLACE TABLE region_mapping AS
SELECT * FROM (VALUES
-- East Asia & Pacific
('CHN', 'East Asia & Pacific'),
('JPN', 'East Asia & Pacific'),
('KOR', 'East Asia & Pacific'),
('AUS', 'East Asia & Pacific'),
('IDN', 'East Asia & Pacific'),
('THA', 'East Asia & Pacific'),
('VNM', 'East Asia & Pacific'),
('MYS', 'East Asia & Pacific'),
('PHL', 'East Asia & Pacific'),
('NZL', 'East Asia & Pacific'),
-- Europe & Central Asia
('DEU', 'Europe & Central Asia'),
('FRA', 'Europe & Central Asia'),
('GBR', 'Europe & Central Asia'),
('ITA', 'Europe & Central Asia'),
('ESP', 'Europe & Central Asia'),
('POL', 'Europe & Central Asia'),
('NLD', 'Europe & Central Asia'),
('TUR', 'Europe & Central Asia'),
('RUS', 'Europe & Central Asia'),
('UKR', 'Europe & Central Asia'),
-- Latin America & Caribbean
('BRA', 'Latin America & Caribbean'),
('MEX', 'Latin America & Caribbean'),
('ARG', 'Latin America & Caribbean'),
('COL', 'Latin America & Caribbean'),
('CHL', 'Latin America & Caribbean'),
('PER', 'Latin America & Caribbean'),
('VEN', 'Latin America & Caribbean'),
-- Middle East & North Africa
('EGY', 'Middle East & North Africa'),
('SAU', 'Middle East & North Africa'),
('IRN', 'Middle East & North Africa'),
('IRQ', 'Middle East & North Africa'),
('MAR', 'Middle East & North Africa'),
('DZA', 'Middle East & North Africa'),
-- North America
('USA', 'North America'),
('CAN', 'North America'),
-- South Asia
('IND', 'South Asia'),
('PAK', 'South Asia'),
('BGD', 'South Asia'),
('LKA', 'South Asia'),
('NPL', 'South Asia'),
-- Sub-Saharan Africa
('NGA', 'Sub-Saharan Africa'),
('ZAF', 'Sub-Saharan Africa'),
('KEN', 'Sub-Saharan Africa'),
('ETH', 'Sub-Saharan Africa'),
('GHA', 'Sub-Saharan Africa'),
('TZA', 'Sub-Saharan Africa')
) AS t(country_code, region)
""")
print("Created region_mapping table")Created region_mapping table
# Join with region mapping to add region information
conn.execute("""
CREATE OR REPLACE TABLE indicator_with_region AS
SELECT
i.*,
COALESCE(r.region, 'Other') AS region
FROM indicator_clean i
LEFT JOIN region_mapping r ON UPPER(i.country_code) = r.country_code
""")
print("Created table: indicator_with_region")Created table: indicator_with_region
| country_code | country_name | year | value | indicator_name | region | |
|---|---|---|---|---|---|---|
| 0 | DZA | Algeria | 1990 | 45.587 | Labor force participation rate, total (% of to... | Middle East & North Africa |
| 1 | ARG | Argentina | 1990 | 59.033 | Labor force participation rate, total (% of to... | Latin America & Caribbean |
| 2 | AUS | Australia | 1990 | 63.894 | Labor force participation rate, total (% of to... | East Asia & Pacific |
| 3 | BGD | Bangladesh | 1990 | 56.638 | Labor force participation rate, total (% of to... | South Asia |
| 4 | BRA | Brazil | 1990 | 64.682 | Labor force participation rate, total (% of to... | Latin America & Caribbean |
| 5 | CAN | Canada | 1990 | 66.760 | Labor force participation rate, total (% of to... | North America |
| 6 | CHL | Chile | 1990 | 52.415 | Labor force participation rate, total (% of to... | Latin America & Caribbean |
| 7 | CHN | China | 1990 | 79.170 | Labor force participation rate, total (% of to... | East Asia & Pacific |
| 8 | COL | Colombia | 1990 | 66.705 | Labor force participation rate, total (% of to... | Latin America & Caribbean |
| 9 | EGY | Egypt, Arab Rep. | 1990 | 46.619 | Labor force participation rate, total (% of to... | Middle East & North Africa |
Calculate average labor force participation rate by region and year.
# Query regional averages and return as pandas DataFrame for charting
regional_df = conn.execute("""
SELECT
region,
year,
AVG(value) AS avg_participation_rate,
COUNT(DISTINCT country_code) AS num_countries
FROM indicator_with_region
WHERE region != 'Unknown'
AND region IS NOT NULL
AND region != ''
GROUP BY region, year
HAVING COUNT(DISTINCT country_code) >= 3
ORDER BY region, year
""").df()
print(f"Regional aggregates: {len(regional_df)} rows")
regional_df.head(10)Regional aggregates: 245 rows
| region | year | avg_participation_rate | num_countries | |
|---|---|---|---|---|
| 0 | East Asia & Pacific | 1990 | 67.3539 | 10 |
| 1 | East Asia & Pacific | 1991 | 67.2899 | 10 |
| 2 | East Asia & Pacific | 1992 | 67.2213 | 10 |
| 3 | East Asia & Pacific | 1993 | 67.0079 | 10 |
| 4 | East Asia & Pacific | 1994 | 67.2205 | 10 |
| 5 | East Asia & Pacific | 1995 | 67.3145 | 10 |
| 6 | East Asia & Pacific | 1996 | 67.3737 | 10 |
| 7 | East Asia & Pacific | 1997 | 67.1655 | 10 |
| 8 | East Asia & Pacific | 1998 | 66.7576 | 10 |
| 9 | East Asia & Pacific | 1999 | 66.4673 | 10 |
Create a line chart showing labor force participation trends by region.
DataTransformerRegistry.enable('default')
# Create the chart
chart = alt.Chart(regional_df).mark_line(point=True).encode(
x=alt.X('year:Q', title='Year'),
y=alt.Y('avg_participation_rate:Q', title='Average Participation Rate (%)'),
color=alt.Color('region:N', title='Region'),
tooltip=[
alt.Tooltip('region:N', title='Region'),
alt.Tooltip('year:Q', title='Year'),
alt.Tooltip('avg_participation_rate:Q', title='Avg Rate', format='.1f'),
alt.Tooltip('num_countries:Q', title='Countries')
]
).properties(
width=700,
height=400,
title='Labor Force Participation Rate by Region Over Time'
)
chart# Get summary stats (DuckDB query → pandas DataFrame)
summary_df = conn.execute("""
SELECT
region,
MIN(year) AS first_year,
MAX(year) AS last_year,
ROUND(AVG(value), 1) AS avg_rate,
COUNT(*) AS data_points
FROM indicator_with_region
WHERE region != 'Unknown' AND region IS NOT NULL AND region != ''
GROUP BY region
ORDER BY avg_rate DESC
""").df()
print("Summary by Region:")
summary_dfSummary by Region:
| region | first_year | last_year | avg_rate | data_points | |
|---|---|---|---|---|---|
| 0 | Sub-Saharan Africa | 1990 | 2024 | 75.0 | 210 |
| 1 | East Asia & Pacific | 1990 | 2024 | 66.5 | 350 |
| 2 | North America | 1990 | 2024 | 65.0 | 70 |
| 3 | Latin America & Caribbean | 1990 | 2024 | 63.3 | 245 |
| 4 | Other | 1990 | 2024 | 61.5 | 6609 |
| 5 | Europe & Central Asia | 1990 | 2024 | 57.5 | 347 |
| 6 | South Asia | 1990 | 2024 | 52.1 | 175 |
| 7 | Middle East & North Africa | 1990 | 2024 | 46.3 | 210 |
This notebook created the following files in the data/ folder: - labor_force_data.csv - Raw indicator data - data_dictionary.csv - Metadata about countries - worldbank.duckdb - DuckDB database with cleaned tables - regional_labor_force_chart.html - Interactive chart
---
jupyter:
jupytext:
text_representation:
extension: .qmd
format_name: quarto
format_version: '1.0'
jupytext_version: 1.18.1
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---
# DEMO NOTEBOOK that will get included with the team submission.
# World Bank Labor Force Analysis
This notebook downloads World Bank labor force participation data and creates
a simple visualization showing trends by region over time.
**What you'll learn:**
- How to download data from the World Bank
- How to use DuckDB for fast CSV loading and SQL analysis
- How to create charts with Altair
**Required packages:**
```
pip install requests duckdb pandas altair
```
## Step 1: Setup and Imports
```{python}
import requests
import duckdb
import altair as alt
from pathlib import Path
```
```{python}
# Create a data folder to store our files
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)
print(f"Data will be saved to: {DATA_DIR.absolute()}")
```
## Step 2: Download the Data
We'll download two files from the World Bank:
1. **Indicator data** - The actual labor force participation rates
2. **Dictionary data** - Metadata about countries and regions
```{python}
# URLs for World Bank data
INDICATOR_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS.csv"
DICTIONARY_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS_DATADICT.csv"
# Local file paths
INDICATOR_FILE = DATA_DIR / "labor_force_data.csv"
DICTIONARY_FILE = DATA_DIR / "data_dictionary.csv"
```
```{python}
# Download indicator data (skip if already exists)
if not INDICATOR_FILE.exists():
print("Downloading indicator data...")
response = requests.get(INDICATOR_URL, timeout=60)
response.raise_for_status()
INDICATOR_FILE.write_bytes(response.content)
print(f"Saved to {INDICATOR_FILE}")
else:
print(f"Using cached file: {INDICATOR_FILE}")
```
```{python}
# Download dictionary data (skip if already exists)
if not DICTIONARY_FILE.exists():
print("Downloading dictionary data...")
response = requests.get(DICTIONARY_URL, timeout=60)
response.raise_for_status()
DICTIONARY_FILE.write_bytes(response.content)
print(f"Saved to {DICTIONARY_FILE}")
else:
print(f"Using cached file: {DICTIONARY_FILE}")
```
## Step 3: Load CSVs into DuckDB
DuckDB can read CSV files directly - no need to load into pandas first!
This is faster and more memory efficient for large files.
```{python}
# Create a DuckDB connection
DB_PATH = DATA_DIR / "worldbank.duckdb"
conn = duckdb.connect(str(DB_PATH))
print(f"Connected to DuckDB: {DB_PATH}")
```
```{python}
# Load indicator CSV directly into DuckDB
conn.execute(f"""
CREATE OR REPLACE TABLE indicator_raw AS
SELECT * FROM read_csv_auto('{INDICATOR_FILE}', header=True)
""")
row_count = conn.execute("SELECT COUNT(*) FROM indicator_raw").fetchone()[0]
print(f"Loaded indicator_raw: {row_count:,} rows")
```
```{python}
# Load dictionary CSV directly into DuckDB
conn.execute(f"""
CREATE OR REPLACE TABLE dictionary AS
SELECT DISTINCT * FROM read_csv_auto('{DICTIONARY_FILE}', header=True)
""")
row_count = conn.execute("SELECT COUNT(*) FROM dictionary").fetchone()[0]
print(f"Loaded dictionary: {row_count:,} rows")
```
## Step 4: Explore the Data
Let's take a quick look at what we loaded using DuckDB queries.
```{python}
# Preview indicator data (returns a pandas DataFrame)
conn.execute("SELECT * FROM indicator_raw LIMIT 5").df()
```
```{python}
# Check indicator columns
conn.execute("DESCRIBE indicator_raw").df()
```
```{python}
# Preview dictionary data
conn.execute("SELECT * FROM dictionary LIMIT 5").df()
```
```{python}
# Check dictionary columns
conn.execute("DESCRIBE dictionary").df()
```
## Step 5: Clean and Transform the Data
Create a clean, analysis-ready table with proper data types and region info.
```{python}
# Create a cleaned indicator table
conn.execute("""
CREATE OR REPLACE TABLE indicator_clean AS
SELECT
REF_AREA AS country_code,
REF_AREA_LABEL AS country_name,
CAST(TIME_PERIOD AS INTEGER) AS year,
CAST(OBS_VALUE AS DOUBLE) AS value,
INDICATOR_LABEL AS indicator_name
FROM indicator_raw
WHERE OBS_VALUE IS NOT NULL
AND TIME_PERIOD IS NOT NULL
""")
row_count = conn.execute("SELECT COUNT(*) FROM indicator_clean").fetchone()[0]
print(f"Cleaned data: {row_count:,} rows")
```
```{python}
# Preview the cleaned data
conn.execute("SELECT * FROM indicator_clean LIMIT 5").df()
```
```{python}
# Check what columns the dictionary table has
print("Dictionary table columns:")
print(conn.execute("DESCRIBE dictionary").df())
```
```{python}
# The World Bank data dictionary file describes variables, not country metadata.
# We'll create a simple region mapping based on country codes.
# For a real project, you'd download a proper country-region mapping file.
# Create a region mapping table based on common World Bank region codes
conn.execute("""
CREATE OR REPLACE TABLE region_mapping AS
SELECT * FROM (VALUES
-- East Asia & Pacific
('CHN', 'East Asia & Pacific'),
('JPN', 'East Asia & Pacific'),
('KOR', 'East Asia & Pacific'),
('AUS', 'East Asia & Pacific'),
('IDN', 'East Asia & Pacific'),
('THA', 'East Asia & Pacific'),
('VNM', 'East Asia & Pacific'),
('MYS', 'East Asia & Pacific'),
('PHL', 'East Asia & Pacific'),
('NZL', 'East Asia & Pacific'),
-- Europe & Central Asia
('DEU', 'Europe & Central Asia'),
('FRA', 'Europe & Central Asia'),
('GBR', 'Europe & Central Asia'),
('ITA', 'Europe & Central Asia'),
('ESP', 'Europe & Central Asia'),
('POL', 'Europe & Central Asia'),
('NLD', 'Europe & Central Asia'),
('TUR', 'Europe & Central Asia'),
('RUS', 'Europe & Central Asia'),
('UKR', 'Europe & Central Asia'),
-- Latin America & Caribbean
('BRA', 'Latin America & Caribbean'),
('MEX', 'Latin America & Caribbean'),
('ARG', 'Latin America & Caribbean'),
('COL', 'Latin America & Caribbean'),
('CHL', 'Latin America & Caribbean'),
('PER', 'Latin America & Caribbean'),
('VEN', 'Latin America & Caribbean'),
-- Middle East & North Africa
('EGY', 'Middle East & North Africa'),
('SAU', 'Middle East & North Africa'),
('IRN', 'Middle East & North Africa'),
('IRQ', 'Middle East & North Africa'),
('MAR', 'Middle East & North Africa'),
('DZA', 'Middle East & North Africa'),
-- North America
('USA', 'North America'),
('CAN', 'North America'),
-- South Asia
('IND', 'South Asia'),
('PAK', 'South Asia'),
('BGD', 'South Asia'),
('LKA', 'South Asia'),
('NPL', 'South Asia'),
-- Sub-Saharan Africa
('NGA', 'Sub-Saharan Africa'),
('ZAF', 'Sub-Saharan Africa'),
('KEN', 'Sub-Saharan Africa'),
('ETH', 'Sub-Saharan Africa'),
('GHA', 'Sub-Saharan Africa'),
('TZA', 'Sub-Saharan Africa')
) AS t(country_code, region)
""")
print("Created region_mapping table")
```
```{python}
# Join with region mapping to add region information
conn.execute("""
CREATE OR REPLACE TABLE indicator_with_region AS
SELECT
i.*,
COALESCE(r.region, 'Other') AS region
FROM indicator_clean i
LEFT JOIN region_mapping r ON UPPER(i.country_code) = r.country_code
""")
print("Created table: indicator_with_region")
```
```{python}
# Preview with regions
conn.execute("SELECT * FROM indicator_with_region LIMIT 10").df()
```
## Step 6: Aggregate by Region
Calculate average labor force participation rate by region and year.
```{python}
# Query regional averages and return as pandas DataFrame for charting
regional_df = conn.execute("""
SELECT
region,
year,
AVG(value) AS avg_participation_rate,
COUNT(DISTINCT country_code) AS num_countries
FROM indicator_with_region
WHERE region != 'Unknown'
AND region IS NOT NULL
AND region != ''
GROUP BY region, year
HAVING COUNT(DISTINCT country_code) >= 3
ORDER BY region, year
""").df()
print(f"Regional aggregates: {len(regional_df)} rows")
regional_df.head(10)
```
```{python}
# What regions do we have?
regional_df['region'].unique()
```
## Step 7: Create the Visualization
Create a line chart showing labor force participation trends by region.
```{python}
# Configure Altair to handle large datasets
alt.data_transformers.disable_max_rows()
```
```{python}
# Create the chart
chart = alt.Chart(regional_df).mark_line(point=True).encode(
x=alt.X('year:Q', title='Year'),
y=alt.Y('avg_participation_rate:Q', title='Average Participation Rate (%)'),
color=alt.Color('region:N', title='Region'),
tooltip=[
alt.Tooltip('region:N', title='Region'),
alt.Tooltip('year:Q', title='Year'),
alt.Tooltip('avg_participation_rate:Q', title='Avg Rate', format='.1f'),
alt.Tooltip('num_countries:Q', title='Countries')
]
).properties(
width=700,
height=400,
title='Labor Force Participation Rate by Region Over Time'
)
chart
```
```{python}
# Save the chart as an HTML file
CHART_FILE = DATA_DIR / "regional_labor_force_chart.html"
chart.save(str(CHART_FILE))
print(f"Chart saved to: {CHART_FILE}")
```
## Step 8: Summary Statistics
```{python}
# Get summary stats (DuckDB query → pandas DataFrame)
summary_df = conn.execute("""
SELECT
region,
MIN(year) AS first_year,
MAX(year) AS last_year,
ROUND(AVG(value), 1) AS avg_rate,
COUNT(*) AS data_points
FROM indicator_with_region
WHERE region != 'Unknown' AND region IS NOT NULL AND region != ''
GROUP BY region
ORDER BY avg_rate DESC
""").df()
print("Summary by Region:")
summary_df
```
```{python}
# Close the database connection
conn.close()
print("Done! Database connection closed.")
```
## Files Created
This notebook created the following files in the `data/` folder:
- `labor_force_data.csv` - Raw indicator data
- `data_dictionary.csv` - Metadata about countries
- `worldbank.duckdb` - DuckDB database with cleaned tables
- `regional_labor_force_chart.html` - Interactive chart