DEMO NOTEBOOK that will get included with the team submission.

World Bank Labor Force Analysis

This notebook downloads World Bank labor force participation data and creates a simple visualization showing trends by region over time.

What you’ll learn: - How to download data from the World Bank - How to use DuckDB for fast CSV loading and SQL analysis - How to create charts with Altair

Required packages:

pip install requests duckdb pandas altair

Step 1: Setup and Imports

Code

import requests
import duckdb
import altair as alt
from pathlib import Path

Code

# Create a data folder to store our files
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

print(f"Data will be saved to: {DATA_DIR.absolute()}")

Data will be saved to: /home/runner/work/DataDive25/DataDive25/Team_Projects/challenge_2/data

Step 2: Download the Data

We’ll download two files from the World Bank: 1. Indicator data - The actual labor force participation rates 2. Dictionary data - Metadata about countries and regions

Code

# URLs for World Bank data
INDICATOR_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS.csv"
DICTIONARY_URL = "https://data360files.worldbank.org/data360-data/data/WB_WDI/WB_WDI_SL_TLF_CACT_ZS_DATADICT.csv"

# Local file paths
INDICATOR_FILE = DATA_DIR / "labor_force_data.csv"
DICTIONARY_FILE = DATA_DIR / "data_dictionary.csv"

Code

# Download indicator data (skip if already exists)
if not INDICATOR_FILE.exists():
    print("Downloading indicator data...")
    response = requests.get(INDICATOR_URL, timeout=60)
    response.raise_for_status()
    INDICATOR_FILE.write_bytes(response.content)
    print(f"Saved to {INDICATOR_FILE}")
else:
    print(f"Using cached file: {INDICATOR_FILE}")

Downloading indicator data...
Saved to data/labor_force_data.csv

Code

# Download dictionary data (skip if already exists)
if not DICTIONARY_FILE.exists():
    print("Downloading dictionary data...")
    response = requests.get(DICTIONARY_URL, timeout=60)
    response.raise_for_status()
    DICTIONARY_FILE.write_bytes(response.content)
    print(f"Saved to {DICTIONARY_FILE}")
else:
    print(f"Using cached file: {DICTIONARY_FILE}")

Downloading dictionary data...
Saved to data/data_dictionary.csv

Step 3: Load CSVs into DuckDB

DuckDB can read CSV files directly - no need to load into pandas first! This is faster and more memory efficient for large files.

Code

# Create a DuckDB connection
DB_PATH = DATA_DIR / "worldbank.duckdb"
conn = duckdb.connect(str(DB_PATH))

print(f"Connected to DuckDB: {DB_PATH}")

Connected to DuckDB: data/worldbank.duckdb

Code

# Load indicator CSV directly into DuckDB
conn.execute(f"""
    CREATE OR REPLACE TABLE indicator_raw AS
    SELECT * FROM read_csv_auto('{INDICATOR_FILE}', header=True)
""")

row_count = conn.execute("SELECT COUNT(*) FROM indicator_raw").fetchone()[0]
print(f"Loaded indicator_raw: {row_count:,} rows")

Loaded indicator_raw: 8,216 rows

Code

# Load dictionary CSV directly into DuckDB
conn.execute(f"""
    CREATE OR REPLACE TABLE dictionary AS
    SELECT DISTINCT * FROM read_csv_auto('{DICTIONARY_FILE}', header=True)
""")

row_count = conn.execute("SELECT COUNT(*) FROM dictionary").fetchone()[0]
print(f"Loaded dictionary: {row_count:,} rows")

Loaded dictionary: 42 rows

Step 4: Explore the Data

Let’s take a quick look at what we loaded using DuckDB queries.

Code

# Preview indicator data (returns a pandas DataFrame)
conn.execute("SELECT * FROM indicator_raw LIMIT 5").df()

	STRUCTURE	STRUCTURE_ID	ACTION	FREQ	FREQ_LABEL	REF_AREA	REF_AREA_LABEL	INDICATOR	INDICATOR_LABEL	SEX	...	DATA_SOURCE_LABEL	UNIT_TYPE	UNIT_TYPE_LABEL	TIME_FORMAT	TIME_FORMAT_LABEL	COMMENT_OBS	OBS_STATUS	OBS_STATUS_LABEL	OBS_CONF	OBS_CONF_LABEL
0	datastructure	WB.DATA360:DS_DATA360(1.3)	I	A	Annual	AFE	Africa Eastern and Southern	WB_WDI_SL_TLF_CACT_ZS	Labor force participation rate, total (% of to...	_T	...	World Development Indicators (WDI)	RATIO	Ratio	P1Y	Annual	None	A	Normal value	PU	Public
1	datastructure	WB.DATA360:DS_DATA360(1.3)	I	A	Annual	AFW	Africa Western and Central	WB_WDI_SL_TLF_CACT_ZS	Labor force participation rate, total (% of to...	_T	...	World Development Indicators (WDI)	RATIO	Ratio	P1Y	Annual	None	A	Normal value	PU	Public
2	datastructure	WB.DATA360:DS_DATA360(1.3)	I	A	Annual	ARB	Arab World	WB_WDI_SL_TLF_CACT_ZS	Labor force participation rate, total (% of to...	_T	...	World Development Indicators (WDI)	RATIO	Ratio	P1Y	Annual	None	A	Normal value	PU	Public
3	datastructure	WB.DATA360:DS_DATA360(1.3)	I	A	Annual	CSS	Caribbean small states	WB_WDI_SL_TLF_CACT_ZS	Labor force participation rate, total (% of to...	_T	...	World Development Indicators (WDI)	RATIO	Ratio	P1Y	Annual	None	A	Normal value	PU	Public
4	datastructure	WB.DATA360:DS_DATA360(1.3)	I	A	Annual	CEB	Central Electricity Board (CEB)	WB_WDI_SL_TLF_CACT_ZS	Labor force participation rate, total (% of to...	_T	...	World Development Indicators (WDI)	RATIO	Ratio	P1Y	Annual	None	A	Normal value	PU	Public

5 rows × 45 columns

Code

# Check indicator columns
conn.execute("DESCRIBE indicator_raw").df()

	column_name	column_type	null	key	default	extra
0	STRUCTURE	VARCHAR	YES	None	None	None
1	STRUCTURE_ID	VARCHAR	YES	None	None	None
2	ACTION	VARCHAR	YES	None	None	None
3	FREQ	VARCHAR	YES	None	None	None
4	FREQ_LABEL	VARCHAR	YES	None	None	None
5	REF_AREA	VARCHAR	YES	None	None	None
6	REF_AREA_LABEL	VARCHAR	YES	None	None	None
7	INDICATOR	VARCHAR	YES	None	None	None
8	INDICATOR_LABEL	VARCHAR	YES	None	None	None
9	SEX	VARCHAR	YES	None	None	None
10	SEX_LABEL	VARCHAR	YES	None	None	None
11	AGE	VARCHAR	YES	None	None	None
12	AGE_LABEL	VARCHAR	YES	None	None	None
13	URBANISATION	VARCHAR	YES	None	None	None
14	URBANISATION_LABEL	VARCHAR	YES	None	None	None
15	UNIT_MEASURE	VARCHAR	YES	None	None	None
16	UNIT_MEASURE_LABEL	VARCHAR	YES	None	None	None
17	COMP_BREAKDOWN_1	VARCHAR	YES	None	None	None
18	COMP_BREAKDOWN_1_LABEL	VARCHAR	YES	None	None	None
19	COMP_BREAKDOWN_2	VARCHAR	YES	None	None	None
20	COMP_BREAKDOWN_2_LABEL	VARCHAR	YES	None	None	None
21	COMP_BREAKDOWN_3	VARCHAR	YES	None	None	None
22	COMP_BREAKDOWN_3_LABEL	VARCHAR	YES	None	None	None
23	TIME_PERIOD	BIGINT	YES	None	None	None
24	OBS_VALUE	DOUBLE	YES	None	None	None
25	AGG_METHOD	VARCHAR	YES	None	None	None
26	AGG_METHOD_LABEL	VARCHAR	YES	None	None	None
27	DECIMALS	BIGINT	YES	None	None	None
28	DECIMALS_LABEL	VARCHAR	YES	None	None	None
29	DATABASE_ID	VARCHAR	YES	None	None	None
30	DATABASE_ID_LABEL	VARCHAR	YES	None	None	None
31	COMMENT_TS	VARCHAR	YES	None	None	None
32	UNIT_MULT	BIGINT	YES	None	None	None
33	UNIT_MULT_LABEL	VARCHAR	YES	None	None	None
34	DATA_SOURCE	VARCHAR	YES	None	None	None
35	DATA_SOURCE_LABEL	VARCHAR	YES	None	None	None
36	UNIT_TYPE	VARCHAR	YES	None	None	None
37	UNIT_TYPE_LABEL	VARCHAR	YES	None	None	None
38	TIME_FORMAT	VARCHAR	YES	None	None	None
39	TIME_FORMAT_LABEL	VARCHAR	YES	None	None	None
40	COMMENT_OBS	VARCHAR	YES	None	None	None
41	OBS_STATUS	VARCHAR	YES	None	None	None
42	OBS_STATUS_LABEL	VARCHAR	YES	None	None	None
43	OBS_CONF	VARCHAR	YES	None	None	None
44	OBS_CONF_LABEL	VARCHAR	YES	None	None	None

Code

# Preview dictionary data
conn.execute("SELECT * FROM dictionary LIMIT 5").df()

	VARIABLE_ID	VARIABLE_LABEL	VARIABLE_DESCRIPTION	VARIABLE_DATA_TYPE	VARIABLE_REQUIRED
0	FREQ_LABEL	Frequency of observation label	Time interval at which observations occur over...	String	True
1	URBANISATION_LABEL	Degree of urbanisation label	Refers to Total, Urban, or Rural location.	String	True
2	COMP_BREAKDOWN_1_LABEL	Custom Dimension 1 label	None	String	True
3	COMP_BREAKDOWN_3_LABEL	Custom Dimension 3 label	None	String	True
4	AGG_METHOD_LABEL	Aggregation method label	None	String	False

Code

# Check dictionary columns  
conn.execute("DESCRIBE dictionary").df()

	column_name	column_type	null	key	default	extra
0	VARIABLE_ID	VARCHAR	YES	None	None	None
1	VARIABLE_LABEL	VARCHAR	YES	None	None	None
2	VARIABLE_DESCRIPTION	VARCHAR	YES	None	None	None
3	VARIABLE_DATA_TYPE	VARCHAR	YES	None	None	None
4	VARIABLE_REQUIRED	BOOLEAN	YES	None	None	None

Step 5: Clean and Transform the Data

Create a clean, analysis-ready table with proper data types and region info.

Code

# Create a cleaned indicator table
conn.execute("""
    CREATE OR REPLACE TABLE indicator_clean AS
    SELECT
        REF_AREA AS country_code,
        REF_AREA_LABEL AS country_name,
        CAST(TIME_PERIOD AS INTEGER) AS year,
        CAST(OBS_VALUE AS DOUBLE) AS value,
        INDICATOR_LABEL AS indicator_name
    FROM indicator_raw
    WHERE OBS_VALUE IS NOT NULL
      AND TIME_PERIOD IS NOT NULL
""")

row_count = conn.execute("SELECT COUNT(*) FROM indicator_clean").fetchone()[0]
print(f"Cleaned data: {row_count:,} rows")

Cleaned data: 8,216 rows

Code

# Preview the cleaned data
conn.execute("SELECT * FROM indicator_clean LIMIT 5").df()

	country_code	country_name	year	value	indicator_name
0	AFE	Africa Eastern and Southern	1990	72.111549	Labor force participation rate, total (% of to...
1	AFW	Africa Western and Central	1990	76.593802	Labor force participation rate, total (% of to...
2	ARB	Arab World	1990	47.597590	Labor force participation rate, total (% of to...
3	CSS	Caribbean small states	1990	59.731386	Labor force participation rate, total (% of to...
4	CEB	Central Electricity Board (CEB)	1990	57.682804	Labor force participation rate, total (% of to...

Code

# Check what columns the dictionary table has
print("Dictionary table columns:")
print(conn.execute("DESCRIBE dictionary").df())

Dictionary table columns:
            column_name column_type null   key default extra
0           VARIABLE_ID     VARCHAR  YES  None    None  None
1        VARIABLE_LABEL     VARCHAR  YES  None    None  None
2  VARIABLE_DESCRIPTION     VARCHAR  YES  None    None  None
3    VARIABLE_DATA_TYPE     VARCHAR  YES  None    None  None
4     VARIABLE_REQUIRED     BOOLEAN  YES  None    None  None

Code

# The World Bank data dictionary file describes variables, not country metadata.
# We'll create a simple region mapping based on country codes.
# For a real project, you'd download a proper country-region mapping file.

# Create a region mapping table based on common World Bank region codes
conn.execute("""
    CREATE OR REPLACE TABLE region_mapping AS
    SELECT * FROM (VALUES
        -- East Asia & Pacific
        ('CHN', 'East Asia & Pacific'),
        ('JPN', 'East Asia & Pacific'),
        ('KOR', 'East Asia & Pacific'),
        ('AUS', 'East Asia & Pacific'),
        ('IDN', 'East Asia & Pacific'),
        ('THA', 'East Asia & Pacific'),
        ('VNM', 'East Asia & Pacific'),
        ('MYS', 'East Asia & Pacific'),
        ('PHL', 'East Asia & Pacific'),
        ('NZL', 'East Asia & Pacific'),
        -- Europe & Central Asia
        ('DEU', 'Europe & Central Asia'),
        ('FRA', 'Europe & Central Asia'),
        ('GBR', 'Europe & Central Asia'),
        ('ITA', 'Europe & Central Asia'),
        ('ESP', 'Europe & Central Asia'),
        ('POL', 'Europe & Central Asia'),
        ('NLD', 'Europe & Central Asia'),
        ('TUR', 'Europe & Central Asia'),
        ('RUS', 'Europe & Central Asia'),
        ('UKR', 'Europe & Central Asia'),
        -- Latin America & Caribbean
        ('BRA', 'Latin America & Caribbean'),
        ('MEX', 'Latin America & Caribbean'),
        ('ARG', 'Latin America & Caribbean'),
        ('COL', 'Latin America & Caribbean'),
        ('CHL', 'Latin America & Caribbean'),
        ('PER', 'Latin America & Caribbean'),
        ('VEN', 'Latin America & Caribbean'),
        -- Middle East & North Africa
        ('EGY', 'Middle East & North Africa'),
        ('SAU', 'Middle East & North Africa'),
        ('IRN', 'Middle East & North Africa'),
        ('IRQ', 'Middle East & North Africa'),
        ('MAR', 'Middle East & North Africa'),
        ('DZA', 'Middle East & North Africa'),
        -- North America
        ('USA', 'North America'),
        ('CAN', 'North America'),
        -- South Asia
        ('IND', 'South Asia'),
        ('PAK', 'South Asia'),
        ('BGD', 'South Asia'),
        ('LKA', 'South Asia'),
        ('NPL', 'South Asia'),
        -- Sub-Saharan Africa
        ('NGA', 'Sub-Saharan Africa'),
        ('ZAF', 'Sub-Saharan Africa'),
        ('KEN', 'Sub-Saharan Africa'),
        ('ETH', 'Sub-Saharan Africa'),
        ('GHA', 'Sub-Saharan Africa'),
        ('TZA', 'Sub-Saharan Africa')
    ) AS t(country_code, region)
""")

print("Created region_mapping table")

Created region_mapping table

Code

# Join with region mapping to add region information
conn.execute("""
    CREATE OR REPLACE TABLE indicator_with_region AS
    SELECT
        i.*,
        COALESCE(r.region, 'Other') AS region
    FROM indicator_clean i
    LEFT JOIN region_mapping r ON UPPER(i.country_code) = r.country_code
""")

print("Created table: indicator_with_region")

Created table: indicator_with_region

Code

# Preview with regions
conn.execute("SELECT * FROM indicator_with_region LIMIT 10").df()

	country_code	country_name	year	value	indicator_name	region
0	DZA	Algeria	1990	45.587	Labor force participation rate, total (% of to...	Middle East & North Africa
1	ARG	Argentina	1990	59.033	Labor force participation rate, total (% of to...	Latin America & Caribbean
2	AUS	Australia	1990	63.894	Labor force participation rate, total (% of to...	East Asia & Pacific
3	BGD	Bangladesh	1990	56.638	Labor force participation rate, total (% of to...	South Asia
4	BRA	Brazil	1990	64.682	Labor force participation rate, total (% of to...	Latin America & Caribbean
5	CAN	Canada	1990	66.760	Labor force participation rate, total (% of to...	North America
6	CHL	Chile	1990	52.415	Labor force participation rate, total (% of to...	Latin America & Caribbean
7	CHN	China	1990	79.170	Labor force participation rate, total (% of to...	East Asia & Pacific
8	COL	Colombia	1990	66.705	Labor force participation rate, total (% of to...	Latin America & Caribbean
9	EGY	Egypt, Arab Rep.	1990	46.619	Labor force participation rate, total (% of to...	Middle East & North Africa

Step 6: Aggregate by Region

Calculate average labor force participation rate by region and year.

Code

# Query regional averages and return as pandas DataFrame for charting
regional_df = conn.execute("""
    SELECT
        region,
        year,
        AVG(value) AS avg_participation_rate,
        COUNT(DISTINCT country_code) AS num_countries
    FROM indicator_with_region
    WHERE region != 'Unknown'
      AND region IS NOT NULL
      AND region != ''
    GROUP BY region, year
    HAVING COUNT(DISTINCT country_code) >= 3
    ORDER BY region, year
""").df()

print(f"Regional aggregates: {len(regional_df)} rows")
regional_df.head(10)

Regional aggregates: 245 rows

	region	year	avg_participation_rate	num_countries
0	East Asia & Pacific	1990	67.3539	10
1	East Asia & Pacific	1991	67.2899	10
2	East Asia & Pacific	1992	67.2213	10
3	East Asia & Pacific	1993	67.0079	10
4	East Asia & Pacific	1994	67.2205	10
5	East Asia & Pacific	1995	67.3145	10
6	East Asia & Pacific	1996	67.3737	10
7	East Asia & Pacific	1997	67.1655	10
8	East Asia & Pacific	1998	66.7576	10
9	East Asia & Pacific	1999	66.4673	10

Code

# What regions do we have?
regional_df['region'].unique()

array(['East Asia & Pacific', 'Europe & Central Asia',
       'Latin America & Caribbean', 'Middle East & North Africa', 'Other',
       'South Asia', 'Sub-Saharan Africa'], dtype=object)

Step 7: Create the Visualization

Create a line chart showing labor force participation trends by region.

Code

# Configure Altair to handle large datasets
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

Code

# Create the chart
chart = alt.Chart(regional_df).mark_line(point=True).encode(
    x=alt.X('year:Q', title='Year'),
    y=alt.Y('avg_participation_rate:Q', title='Average Participation Rate (%)'),
    color=alt.Color('region:N', title='Region'),
    tooltip=[
        alt.Tooltip('region:N', title='Region'),
        alt.Tooltip('year:Q', title='Year'),
        alt.Tooltip('avg_participation_rate:Q', title='Avg Rate', format='.1f'),
        alt.Tooltip('num_countries:Q', title='Countries')
    ]
).properties(
    width=700,
    height=400,
    title='Labor Force Participation Rate by Region Over Time'
)

chart

Code

# Save the chart as an HTML file
CHART_FILE = DATA_DIR / "regional_labor_force_chart.html"
chart.save(str(CHART_FILE))
print(f"Chart saved to: {CHART_FILE}")

Chart saved to: data/regional_labor_force_chart.html

Step 8: Summary Statistics

Code

# Get summary stats (DuckDB query → pandas DataFrame)
summary_df = conn.execute("""
    SELECT
        region,
        MIN(year) AS first_year,
        MAX(year) AS last_year,
        ROUND(AVG(value), 1) AS avg_rate,
        COUNT(*) AS data_points
    FROM indicator_with_region
    WHERE region != 'Unknown' AND region IS NOT NULL AND region != ''
    GROUP BY region
    ORDER BY avg_rate DESC
""").df()

print("Summary by Region:")
summary_df

Summary by Region:

	region	first_year	last_year	avg_rate	data_points
0	Sub-Saharan Africa	1990	2024	75.0	210
1	East Asia & Pacific	1990	2024	66.5	350
2	North America	1990	2024	65.0	70
3	Latin America & Caribbean	1990	2024	63.3	245
4	Other	1990	2024	61.5	6609
5	Europe & Central Asia	1990	2024	57.5	347
6	South Asia	1990	2024	52.1	175
7	Middle East & North Africa	1990	2024	46.3	210

Code

# Close the database connection
conn.close()
print("Done! Database connection closed.")

Done! Database connection closed.

Files Created

This notebook created the following files in the data/ folder: - labor_force_data.csv - Raw indicator data - data_dictionary.csv - Metadata about countries - worldbank.duckdb - DuckDB database with cleaned tables - regional_labor_force_chart.html - Interactive chart