DEMO NOTEBOOK that will get included with the team submission.

World Bank Labor Force Analysis

This notebook downloads World Bank labor force participation data and creates a simple visualization showing trends by region over time.

What you’ll learn: - How to download data from the World Bank - How to use DuckDB for fast CSV loading and SQL analysis - How to create charts with Altair

Required packages:

pip install requests duckdb pandas altair

Step 1: Setup and Imports

Code
import requests
import duckdb
import altair as alt
from pathlib import Path
Code
# Create a data folder to store our files
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

print(f"Data will be saved to: {DATA_DIR.absolute()}")
Data will be saved to: /home/runner/work/DataDive25/DataDive25/Team_Projects/GenerationNext/data

Step 2: Download the Data

We’ll download two files from the World Bank: 1. Indicator data - The actual labor force participation rates 2. Dictionary data - Metadata about countries and regions

Code
# URLs for World Bank data
INDICATOR_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS.csv"
DICTIONARY_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS_DATADICT.csv"

# Local file paths
INDICATOR_FILE = DATA_DIR / "labor_force_data.csv"
DICTIONARY_FILE = DATA_DIR / "data_dictionary.csv"
Code
# Download indicator data (skip if already exists)
if not INDICATOR_FILE.exists():
    print("Downloading indicator data...")
    response = requests.get(INDICATOR_URL, timeout=60)
    response.raise_for_status()
    INDICATOR_FILE.write_bytes(response.content)
    print(f"Saved to {INDICATOR_FILE}")
else:
    print(f"Using cached file: {INDICATOR_FILE}")
Downloading indicator data...
Saved to data/labor_force_data.csv
Code
# Download dictionary data (skip if already exists)
if not DICTIONARY_FILE.exists():
    print("Downloading dictionary data...")
    response = requests.get(DICTIONARY_URL, timeout=60)
    response.raise_for_status()
    DICTIONARY_FILE.write_bytes(response.content)
    print(f"Saved to {DICTIONARY_FILE}")
else:
    print(f"Using cached file: {DICTIONARY_FILE}")
Downloading dictionary data...
Saved to data/data_dictionary.csv

Step 3: Load CSVs into DuckDB

DuckDB can read CSV files directly - no need to load into pandas first! This is faster and more memory efficient for large files.

Code
# Create a DuckDB connection
DB_PATH = DATA_DIR / "worldbank.duckdb"
conn = duckdb.connect(str(DB_PATH))

print(f"Connected to DuckDB: {DB_PATH}")
Connected to DuckDB: data/worldbank.duckdb
Code
# Load indicator CSV directly into DuckDB
conn.execute(f"""
    CREATE OR REPLACE TABLE indicator_raw AS
    SELECT * FROM read_csv_auto('{INDICATOR_FILE}', header=True)
""")

row_count = conn.execute("SELECT COUNT(*) FROM indicator_raw").fetchone()[0]
print(f"Loaded indicator_raw: {row_count:,} rows")
Loaded indicator_raw: 8,216 rows
Code
# Load dictionary CSV directly into DuckDB
conn.execute(f"""
    CREATE OR REPLACE TABLE dictionary AS
    SELECT DISTINCT * FROM read_csv_auto('{DICTIONARY_FILE}', header=True)
""")

row_count = conn.execute("SELECT COUNT(*) FROM dictionary").fetchone()[0]
print(f"Loaded dictionary: {row_count:,} rows")
Loaded dictionary: 42 rows

Step 4: Explore the Data

Let’s take a quick look at what we loaded using DuckDB queries.

Code
# Preview indicator data (returns a pandas DataFrame)
conn.execute("SELECT * FROM indicator_raw LIMIT 5").df()
STRUCTURE STRUCTURE_ID ACTION FREQ FREQ_LABEL REF_AREA REF_AREA_LABEL INDICATOR INDICATOR_LABEL SEX ... DATA_SOURCE_LABEL UNIT_TYPE UNIT_TYPE_LABEL TIME_FORMAT TIME_FORMAT_LABEL COMMENT_OBS OBS_STATUS OBS_STATUS_LABEL OBS_CONF OBS_CONF_LABEL
0 datastructure WB.DATA360:DS_DATA360(1.3) I A Annual AFE Africa Eastern and Southern WB_WDI_SL_TLF_CACT_ZS Labor force participation rate, total (% of to... _T ... World Development Indicators (WDI) RATIO Ratio P1Y Annual None A Normal value PU Public
1 datastructure WB.DATA360:DS_DATA360(1.3) I A Annual AFW Africa Western and Central WB_WDI_SL_TLF_CACT_ZS Labor force participation rate, total (% of to... _T ... World Development Indicators (WDI) RATIO Ratio P1Y Annual None A Normal value PU Public
2 datastructure WB.DATA360:DS_DATA360(1.3) I A Annual ARB Arab World WB_WDI_SL_TLF_CACT_ZS Labor force participation rate, total (% of to... _T ... World Development Indicators (WDI) RATIO Ratio P1Y Annual None A Normal value PU Public
3 datastructure WB.DATA360:DS_DATA360(1.3) I A Annual CSS Caribbean small states WB_WDI_SL_TLF_CACT_ZS Labor force participation rate, total (% of to... _T ... World Development Indicators (WDI) RATIO Ratio P1Y Annual None A Normal value PU Public
4 datastructure WB.DATA360:DS_DATA360(1.3) I A Annual CEB Central Electricity Board (CEB) WB_WDI_SL_TLF_CACT_ZS Labor force participation rate, total (% of to... _T ... World Development Indicators (WDI) RATIO Ratio P1Y Annual None A Normal value PU Public

5 rows × 45 columns

Code
# Check indicator columns
conn.execute("DESCRIBE indicator_raw").df()
column_name column_type null key default extra
0 STRUCTURE VARCHAR YES None None None
1 STRUCTURE_ID VARCHAR YES None None None
2 ACTION VARCHAR YES None None None
3 FREQ VARCHAR YES None None None
4 FREQ_LABEL VARCHAR YES None None None
5 REF_AREA VARCHAR YES None None None
6 REF_AREA_LABEL VARCHAR YES None None None
7 INDICATOR VARCHAR YES None None None
8 INDICATOR_LABEL VARCHAR YES None None None
9 SEX VARCHAR YES None None None
10 SEX_LABEL VARCHAR YES None None None
11 AGE VARCHAR YES None None None
12 AGE_LABEL VARCHAR YES None None None
13 URBANISATION VARCHAR YES None None None
14 URBANISATION_LABEL VARCHAR YES None None None
15 UNIT_MEASURE VARCHAR YES None None None
16 UNIT_MEASURE_LABEL VARCHAR YES None None None
17 COMP_BREAKDOWN_1 VARCHAR YES None None None
18 COMP_BREAKDOWN_1_LABEL VARCHAR YES None None None
19 COMP_BREAKDOWN_2 VARCHAR YES None None None
20 COMP_BREAKDOWN_2_LABEL VARCHAR YES None None None
21 COMP_BREAKDOWN_3 VARCHAR YES None None None
22 COMP_BREAKDOWN_3_LABEL VARCHAR YES None None None
23 TIME_PERIOD BIGINT YES None None None
24 OBS_VALUE DOUBLE YES None None None
25 AGG_METHOD VARCHAR YES None None None
26 AGG_METHOD_LABEL VARCHAR YES None None None
27 DECIMALS BIGINT YES None None None
28 DECIMALS_LABEL VARCHAR YES None None None
29 DATABASE_ID VARCHAR YES None None None
30 DATABASE_ID_LABEL VARCHAR YES None None None
31 COMMENT_TS VARCHAR YES None None None
32 UNIT_MULT BIGINT YES None None None
33 UNIT_MULT_LABEL VARCHAR YES None None None
34 DATA_SOURCE VARCHAR YES None None None
35 DATA_SOURCE_LABEL VARCHAR YES None None None
36 UNIT_TYPE VARCHAR YES None None None
37 UNIT_TYPE_LABEL VARCHAR YES None None None
38 TIME_FORMAT VARCHAR YES None None None
39 TIME_FORMAT_LABEL VARCHAR YES None None None
40 COMMENT_OBS VARCHAR YES None None None
41 OBS_STATUS VARCHAR YES None None None
42 OBS_STATUS_LABEL VARCHAR YES None None None
43 OBS_CONF VARCHAR YES None None None
44 OBS_CONF_LABEL VARCHAR YES None None None
Code
# Preview dictionary data
conn.execute("SELECT * FROM dictionary LIMIT 5").df()
VARIABLE_ID VARIABLE_LABEL VARIABLE_DESCRIPTION VARIABLE_DATA_TYPE VARIABLE_REQUIRED
0 FREQ_LABEL Frequency of observation label Time interval at which observations occur over... String True
1 URBANISATION_LABEL Degree of urbanisation label Refers to Total, Urban, or Rural location. String True
2 COMP_BREAKDOWN_1_LABEL Custom Dimension 1 label None String True
3 COMP_BREAKDOWN_3_LABEL Custom Dimension 3 label None String True
4 AGG_METHOD_LABEL Aggregation method label None String False
Code
# Check dictionary columns  
conn.execute("DESCRIBE dictionary").df()
column_name column_type null key default extra
0 VARIABLE_ID VARCHAR YES None None None
1 VARIABLE_LABEL VARCHAR YES None None None
2 VARIABLE_DESCRIPTION VARCHAR YES None None None
3 VARIABLE_DATA_TYPE VARCHAR YES None None None
4 VARIABLE_REQUIRED BOOLEAN YES None None None

Step 5: Clean and Transform the Data

Create a clean, analysis-ready table with proper data types and region info.

Code
# Create a cleaned indicator table
conn.execute("""
    CREATE OR REPLACE TABLE indicator_clean AS
    SELECT
        REF_AREA AS country_code,
        REF_AREA_LABEL AS country_name,
        CAST(TIME_PERIOD AS INTEGER) AS year,
        CAST(OBS_VALUE AS DOUBLE) AS value,
        INDICATOR_LABEL AS indicator_name
    FROM indicator_raw
    WHERE OBS_VALUE IS NOT NULL
      AND TIME_PERIOD IS NOT NULL
""")

row_count = conn.execute("SELECT COUNT(*) FROM indicator_clean").fetchone()[0]
print(f"Cleaned data: {row_count:,} rows")
Cleaned data: 8,216 rows
Code
# Preview the cleaned data
conn.execute("SELECT * FROM indicator_clean LIMIT 5").df()
country_code country_name year value indicator_name
0 AFE Africa Eastern and Southern 1990 72.111549 Labor force participation rate, total (% of to...
1 AFW Africa Western and Central 1990 76.593802 Labor force participation rate, total (% of to...
2 ARB Arab World 1990 47.597590 Labor force participation rate, total (% of to...
3 CSS Caribbean small states 1990 59.731386 Labor force participation rate, total (% of to...
4 CEB Central Electricity Board (CEB) 1990 57.682804 Labor force participation rate, total (% of to...
Code
# Check what columns the dictionary table has
print("Dictionary table columns:")
print(conn.execute("DESCRIBE dictionary").df())
Dictionary table columns:
            column_name column_type null   key default extra
0           VARIABLE_ID     VARCHAR  YES  None    None  None
1        VARIABLE_LABEL     VARCHAR  YES  None    None  None
2  VARIABLE_DESCRIPTION     VARCHAR  YES  None    None  None
3    VARIABLE_DATA_TYPE     VARCHAR  YES  None    None  None
4     VARIABLE_REQUIRED     BOOLEAN  YES  None    None  None
Code
# The World Bank data dictionary file describes variables, not country metadata.
# We'll create a simple region mapping based on country codes.
# For a real project, you'd download a proper country-region mapping file.

# Create a region mapping table based on common World Bank region codes
conn.execute("""
    CREATE OR REPLACE TABLE region_mapping AS
    SELECT * FROM (VALUES
        -- East Asia & Pacific
        ('CHN', 'East Asia & Pacific'),
        ('JPN', 'East Asia & Pacific'),
        ('KOR', 'East Asia & Pacific'),
        ('AUS', 'East Asia & Pacific'),
        ('IDN', 'East Asia & Pacific'),
        ('THA', 'East Asia & Pacific'),
        ('VNM', 'East Asia & Pacific'),
        ('MYS', 'East Asia & Pacific'),
        ('PHL', 'East Asia & Pacific'),
        ('NZL', 'East Asia & Pacific'),
        -- Europe & Central Asia
        ('DEU', 'Europe & Central Asia'),
        ('FRA', 'Europe & Central Asia'),
        ('GBR', 'Europe & Central Asia'),
        ('ITA', 'Europe & Central Asia'),
        ('ESP', 'Europe & Central Asia'),
        ('POL', 'Europe & Central Asia'),
        ('NLD', 'Europe & Central Asia'),
        ('TUR', 'Europe & Central Asia'),
        ('RUS', 'Europe & Central Asia'),
        ('UKR', 'Europe & Central Asia'),
        -- Latin America & Caribbean
        ('BRA', 'Latin America & Caribbean'),
        ('MEX', 'Latin America & Caribbean'),
        ('ARG', 'Latin America & Caribbean'),
        ('COL', 'Latin America & Caribbean'),
        ('CHL', 'Latin America & Caribbean'),
        ('PER', 'Latin America & Caribbean'),
        ('VEN', 'Latin America & Caribbean'),
        -- Middle East & North Africa
        ('EGY', 'Middle East & North Africa'),
        ('SAU', 'Middle East & North Africa'),
        ('IRN', 'Middle East & North Africa'),
        ('IRQ', 'Middle East & North Africa'),
        ('MAR', 'Middle East & North Africa'),
        ('DZA', 'Middle East & North Africa'),
        -- North America
        ('USA', 'North America'),
        ('CAN', 'North America'),
        -- South Asia
        ('IND', 'South Asia'),
        ('PAK', 'South Asia'),
        ('BGD', 'South Asia'),
        ('LKA', 'South Asia'),
        ('NPL', 'South Asia'),
        -- Sub-Saharan Africa
        ('NGA', 'Sub-Saharan Africa'),
        ('ZAF', 'Sub-Saharan Africa'),
        ('KEN', 'Sub-Saharan Africa'),
        ('ETH', 'Sub-Saharan Africa'),
        ('GHA', 'Sub-Saharan Africa'),
        ('TZA', 'Sub-Saharan Africa')
    ) AS t(country_code, region)
""")

print("Created region_mapping table")
Created region_mapping table
Code
# Join with region mapping to add region information
conn.execute("""
    CREATE OR REPLACE TABLE indicator_with_region AS
    SELECT
        i.*,
        COALESCE(r.region, 'Other') AS region
    FROM indicator_clean i
    LEFT JOIN region_mapping r ON UPPER(i.country_code) = r.country_code
""")

print("Created table: indicator_with_region")
Created table: indicator_with_region
Code
# Preview with regions
conn.execute("SELECT * FROM indicator_with_region LIMIT 10").df()
country_code country_name year value indicator_name region
0 DZA Algeria 1990 45.587 Labor force participation rate, total (% of to... Middle East & North Africa
1 ARG Argentina 1990 59.033 Labor force participation rate, total (% of to... Latin America & Caribbean
2 AUS Australia 1990 63.894 Labor force participation rate, total (% of to... East Asia & Pacific
3 BGD Bangladesh 1990 56.638 Labor force participation rate, total (% of to... South Asia
4 BRA Brazil 1990 64.682 Labor force participation rate, total (% of to... Latin America & Caribbean
5 CAN Canada 1990 66.760 Labor force participation rate, total (% of to... North America
6 CHL Chile 1990 52.415 Labor force participation rate, total (% of to... Latin America & Caribbean
7 CHN China 1990 79.170 Labor force participation rate, total (% of to... East Asia & Pacific
8 COL Colombia 1990 66.705 Labor force participation rate, total (% of to... Latin America & Caribbean
9 EGY Egypt, Arab Rep. 1990 46.619 Labor force participation rate, total (% of to... Middle East & North Africa

Step 6: Aggregate by Region

Calculate average labor force participation rate by region and year.

Code
# Query regional averages and return as pandas DataFrame for charting
regional_df = conn.execute("""
    SELECT
        region,
        year,
        AVG(value) AS avg_participation_rate,
        COUNT(DISTINCT country_code) AS num_countries
    FROM indicator_with_region
    WHERE region != 'Unknown'
      AND region IS NOT NULL
      AND region != ''
    GROUP BY region, year
    HAVING COUNT(DISTINCT country_code) >= 3
    ORDER BY region, year
""").df()

print(f"Regional aggregates: {len(regional_df)} rows")
regional_df.head(10)
Regional aggregates: 245 rows
region year avg_participation_rate num_countries
0 East Asia & Pacific 1990 67.3539 10
1 East Asia & Pacific 1991 67.2899 10
2 East Asia & Pacific 1992 67.2213 10
3 East Asia & Pacific 1993 67.0079 10
4 East Asia & Pacific 1994 67.2205 10
5 East Asia & Pacific 1995 67.3145 10
6 East Asia & Pacific 1996 67.3737 10
7 East Asia & Pacific 1997 67.1655 10
8 East Asia & Pacific 1998 66.7576 10
9 East Asia & Pacific 1999 66.4673 10
Code
# What regions do we have?
regional_df['region'].unique()
array(['East Asia & Pacific', 'Europe & Central Asia',
       'Latin America & Caribbean', 'Middle East & North Africa', 'Other',
       'South Asia', 'Sub-Saharan Africa'], dtype=object)

Step 7: Create the Visualization

Create a line chart showing labor force participation trends by region.

Code
# Configure Altair to handle large datasets
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')
Code
# Create the chart
chart = alt.Chart(regional_df).mark_line(point=True).encode(
    x=alt.X('year:Q', title='Year'),
    y=alt.Y('avg_participation_rate:Q', title='Average Participation Rate (%)'),
    color=alt.Color('region:N', title='Region'),
    tooltip=[
        alt.Tooltip('region:N', title='Region'),
        alt.Tooltip('year:Q', title='Year'),
        alt.Tooltip('avg_participation_rate:Q', title='Avg Rate', format='.1f'),
        alt.Tooltip('num_countries:Q', title='Countries')
    ]
).properties(
    width=700,
    height=400,
    title='Labor Force Participation Rate by Region Over Time'
)

chart
Code
# Save the chart as an HTML file
CHART_FILE = DATA_DIR / "regional_labor_force_chart.html"
chart.save(str(CHART_FILE))
print(f"Chart saved to: {CHART_FILE}")
Chart saved to: data/regional_labor_force_chart.html

Step 8: Summary Statistics

Code
# Get summary stats (DuckDB query → pandas DataFrame)
summary_df = conn.execute("""
    SELECT
        region,
        MIN(year) AS first_year,
        MAX(year) AS last_year,
        ROUND(AVG(value), 1) AS avg_rate,
        COUNT(*) AS data_points
    FROM indicator_with_region
    WHERE region != 'Unknown' AND region IS NOT NULL AND region != ''
    GROUP BY region
    ORDER BY avg_rate DESC
""").df()

print("Summary by Region:")
summary_df
Summary by Region:
region first_year last_year avg_rate data_points
0 Sub-Saharan Africa 1990 2024 75.0 210
1 East Asia & Pacific 1990 2024 66.5 350
2 North America 1990 2024 65.0 70
3 Latin America & Caribbean 1990 2024 63.3 245
4 Other 1990 2024 61.5 6609
5 Europe & Central Asia 1990 2024 57.5 347
6 South Asia 1990 2024 52.1 175
7 Middle East & North Africa 1990 2024 46.3 210
Code
# Close the database connection
conn.close()
print("Done! Database connection closed.")
Done! Database connection closed.

Files Created

This notebook created the following files in the data/ folder: - labor_force_data.csv - Raw indicator data - data_dictionary.csv - Metadata about countries - worldbank.duckdb - DuckDB database with cleaned tables - regional_labor_force_chart.html - Interactive chart