🔬 Colab 🤗 HuggingFace ⚡ GitHub 🎨 Gradio 🧠 sklearn 🚀 AutoML

Digital Marketing,
Marketing Analytics & AI+

A hands-on, 10-week applied course for non-programmers. Every concept starts with a marketing analogy, then becomes runnable Python code. No prior coding experience required — just business curiosity and a Google account.

Weeks
10 modules
Tools
Colab · HF · Gradio · GitHub
AI Co-Pilot
Student & Manager modes
Final Project
Live Kaggle deployment
🙏 Design credit: This interactive lab is adapted from the Gies iMBA Learning Lab (Gies College of Business, UIUC). The three-panel layout, collapsible lab pattern, AI co-pilot concept, and Canvas simulations are innovations from that original framework. This is a respectful adaptation for the ML in Marketing curriculum at Higher Colleges of Technology — not an original creation.
Week 1 · Introduction

Why ML for Marketing?

Machine learning is not magic — it is a systematic way of finding patterns in data at a scale humans cannot match manually. As a marketer, you already possess the most valuable skill: you understand what question to ask. ML gives you the tools to answer it at scale.

The core loop: You have data about customers → ML finds the pattern → the pattern predicts future behaviour → you act on that prediction. Everything in this course is a variation on that loop.

We distinguish two types: Supervised (you provide the answer key — e.g., which customers churned) and Unsupervised (the algorithm discovers structure — e.g., natural customer segments). Marketing primarily uses supervised learning.

Key vocabulary: A feature is any known information about a customer (age, purchase history, email open rate). A label is what you want to predict (did they buy? will they churn?). Your model learns the relationship between features and labels from historical data.

EXAMPLE 1.1 Setting Up Your Colab Environment

▶ Open in Colab
Python · Setup
# ── Cell 1: Install & import everything for this course ──
!pip install scikit-learn pandas matplotlib seaborn --quiet

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ Environment ready! Let's do some marketing ML.")

EXAMPLE 1.2 Loading a Real Marketing Dataset

Python · Load Data
# ── Cell 2: Load UCI Bank Marketing dataset ──
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/bank-additional.csv"
df = pd.read_csv(url, sep=";")

print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()

EXAMPLE 1.3 Your First Exploratory Data Analysis

Python · EDA
# ── Cell 3: Understand the target variable ──
print("=== Target Distribution ===")
print(df["y"].value_counts())
print(df["y"].value_counts(normalize=True).round(3))

fig, ax = plt.subplots(figsize=(8,4))
for outcome in ["no", "yes"]:
    ax.hist(df[df["y"]==outcome]["age"], bins=30, alpha=.6, label=f"Subscribed: {outcome}")
ax.set_xlabel("Age"); ax.legend(); plt.tight_layout(); plt.show()
Output
=== Target Distribution === no 36548 (88.7%) yes 4640 (11.3%)
Marketing insight

Only 11.3% of contacts subscribed — a class imbalance. If we predicted "no" for everyone, we'd be right 88.7% of the time! This is why Accuracy is misleading for marketing ML. We fix this in Week 3.

Try it yourself

  1. How many unique job types are in the dataset? Use df["job"].value_counts()
  2. What is the average campaign calls for subscribers vs non-subscribers? Use df.groupby("y")["campaign"].mean()
  3. Install Hugging Face datasets: !pip install datasets then from datasets import load_dataset
Week 1 Takeaway

ML does not replace marketing judgment — it amplifies it. Your job is to know what question to ask; Python's job is to find the pattern at scale.

Week 2 · sklearn Foundations

sklearn Blocks & Your First Model

The scikit-learn library follows one consistent interface: fit → predict → score. Learn these three methods once and you can use any of the 50+ models in the library. The pattern is identical whether predicting house prices or customer churn.

The fundamental principle: We train on past data to predict future data we have never seen. This is why we hold out some data for testing — testing on training data just checks whether the model memorised history, not whether it can generalise.

EXAMPLE 2.1 Train/Test Split + Linear Regression

▶ Open in Colab
Python · Regression
import pandas as pd; import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

np.random.seed(42); n = 500
df = pd.DataFrame({
    "ad_spend":    np.random.uniform(1000, 50000, n),
    "email_opens": np.random.randint(50, 2000, n),
    "social_posts":np.random.randint(5, 100, n),
})
df["revenue"] = (3.5*df["ad_spend"] + 12*df["email_opens"]
    + 300*df["social_posts"] + np.random.normal(0,8000,n))

X = df[["ad_spend","email_opens","social_posts"]]; y = df["revenue"]
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=.2, random_state=42)

model = LinearRegression().fit(X_train, y_train)
print(f"R² on test set: {model.score(X_test, y_test):.3f}")
print("Coefficients:", dict(zip(X.columns, model.coef_.round(2))))

EXAMPLE 2.2 Classification — Predicting Purchase Intent

Python · Classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Numeric columns from the bank dataset loaded in Week 1
num_cols = ["age","campaign","pdays","previous"]
X2 = df[num_cols].fillna(0)
y2 = (df["y"] == "yes").astype(int)

X_tr,X_te,y_tr,y_te = train_test_split(X2, y2, test_size=.2, random_state=42)
clf = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)

print(classification_report(y_te, clf.predict(X_te)))
fig, ax = plt.subplots(figsize=(4,4))
ConfusionMatrixDisplay.from_estimator(clf, X_te, y_te, ax=ax)
plt.tight_layout(); plt.show()

Live Simulation — Train/Test Split Visualiser

⚡ Adjust test size and watch the split change
20%
Regression vs Classification

Regression predicts a number (lifetime value, expected spend). Classification predicts a category (will buy / won't buy). The sklearn interface is identical — only the output and metric change.

Try it yourself

  1. Change test_size=0.3. Does R² improve or worsen? Why?
  2. Try LogisticRegression(class_weight="balanced"). How does the recall for class 1 change?
  3. Use the model to predict revenue for a new campaign: model.predict([[20000, 500, 30]])
Week 2 Takeaway

sklearn's unified fit → predict → score interface means you can swap any model with two lines of code. Master the workflow once, then experiment freely.

Week 3 · Evaluation & Cross-Validation

Are We Actually Good?

A single train/test split is like running one A/B test in one city and calling it global truth. The result depends on which 20% you happened to hold out — different random seeds give meaningfully different scores. Cross-validation fixes this by rotating through every part of your data as the test set.

The K-Fold analogy: You want to test whether a new loyalty email works. Instead of testing only on your Dubai customers, you test on Dubai, Abu Dhabi, Sharjah, Al Ain, and RAK separately, then average. That is 5-Fold Cross-Validation. Each emirate is one "fold."

EXAMPLE 3.1 Why a Single Split Lies — K-Fold Solution

▶ Open in Colab
Python · The variance problem
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=10, random_state=0)
model = RandomForestClassifier(n_estimators=50, random_state=1)

# Same model, 20 different splits → how much do scores vary?
scores = [train_test_split(X,y,test_size=.2,random_state=s) for s in range(20)]
single_scores = [model.fit(Xtr,ytr).score(Xte,yte) for Xtr,Xte,ytr,yte in scores]
print(f"Single-split range: {min(single_scores):.3f} to {max(single_scores):.3f}")

# 5-Fold CV: stable mean ± honest uncertainty
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv = cross_val_score(model, X, y, cv=skf, scoring="accuracy")
print(f"5-Fold CV: {cv.mean():.3f} ± {cv.std():.3f}")
print(f"Per-fold:  {cv.round(3)}")
Output
Single-split range: 0.885 to 0.940 5-Fold CV: 0.913 ± 0.014 Per-fold: [0.900 0.920 0.905 0.935 0.905]

EXAMPLE 3.2 Evaluation Metrics — The Full Picture

Python · Metrics + ROC Curve
from sklearn.metrics import classification_report, roc_auc_score, RocCurveDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

X, y = make_classification(n_samples=5000, weights=[.89,.11], random_state=0)
X_tr,X_te,y_tr,y_te = train_test_split(X, y, test_size=.2, stratify=y, random_state=42)

naive = np.zeros_like(y_te)
print(f"Naive accuracy (predict all 'no'): {(naive==y_te).mean():.3f}")

clf = LogisticRegression(class_weight="balanced", max_iter=1000).fit(X_tr, y_tr)
print(classification_report(y_te, clf.predict(X_te)))
print(f"AUC-ROC: {roc_auc_score(y_te, clf.predict_proba(X_te)[:,1]):.3f}")

fig, ax = plt.subplots(figsize=(5,5))
RocCurveDisplay.from_estimator(clf, X_te, y_te, ax=ax)
ax.plot([0,1],[0,1],"k--",label="Random"); ax.legend(); plt.show()

EXAMPLE 3.3 Stratified K-Fold Inside a Pipeline — The Correct Way

Python · CV + Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate, StratifiedKFold
import pandas as pd

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("clf",    LogisticRegression(class_weight="balanced", max_iter=1000)),
])

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(pipe, X, y, cv=skf,
                          scoring=["accuracy","f1","roc_auc"],
                          return_train_score=True)

summary = pd.DataFrame({
    m: {"mean":results[f"test_{m}"].mean(), "std":results[f"test_{m}"].std()}
    for m in ["accuracy","f1","roc_auc"]
}).T.round(3)
print(summary)
Critical rule — Data Leakage

Always run cross-validation inside a Pipeline. If you scale all data first and then CV, test folds can "see" training statistics — that is data leakage. The Pipeline refits the scaler only on each training fold automatically.

Live Simulation — K-Fold Visualiser

⚡ Adjust K to see how folds rotate through your data
5

Try it yourself

  1. Compare cross_val_score(..., scoring="f1") vs scoring="accuracy" on imbalanced data. Which tells the truer story?
  2. Try cv=10. Does the mean change much? Does the standard deviation go up or down?
  3. Look up TimeSeriesSplit in sklearn docs. Why would you need this for weekly campaign data?
Week 3 Takeaway

Always report CV mean ± std, not a single test score. For imbalanced marketing data, AUC-ROC and F1 tell the truth that Accuracy hides. Always keep preprocessing inside a Pipeline when cross-validating.

Week 4 · Hyperparameter Tuning

Finding the Best Settings

Every ML model ships with default settings that are "good enough" — but not optimal for your data. Hyperparameters are knobs you control before training (like max_depth or learning_rate). Tuning is systematically finding better ones.

Three strategies: Grid Search — try every combination (thorough, slow). Random Search — try random combinations (80% of the benefit at 20% of the cost). Bayesian Optuna — use past results to guide the search intelligently.

EXAMPLE 4.1 Grid Search & Random Search

▶ Open in Colab
Python · GridSearch + RandomSearch
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import time

X, y = make_classification(n_samples=2000, n_features=15, random_state=0)
X_tr,X_te,y_tr,y_te = train_test_split(X, y, test_size=.2, random_state=42)

param_grid = {"max_depth":[3,5,8,12], "n_estimators":[50,100,200], "min_samples_leaf":[1,5,10]}

t0 = time.time()
gs = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring="roc_auc", n_jobs=-1)
gs.fit(X_tr, y_tr)
print(f"Grid Search  ⏱ {time.time()-t0:.1f}s | Best AUC: {gs.best_score_:.4f}")
print("Best:", gs.best_params_)

t0 = time.time()
rs = RandomizedSearchCV(RandomForestClassifier(random_state=42), param_grid, n_iter=20, cv=5, scoring="roc_auc", random_state=42, n_jobs=-1)
rs.fit(X_tr, y_tr)
print(f"Random Search ⏱ {time.time()-t0:.1f}s | Best AUC: {rs.best_score_:.4f}")

EXAMPLE 4.2 Bayesian Tuning with Optuna

Python · Optuna
!pip install optuna --quiet
import optuna; optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    m = RandomForestClassifier(
        n_estimators =trial.suggest_int("n_estimators",50,300),
        max_depth    =trial.suggest_int("max_depth",2,15),
        min_samples_leaf=trial.suggest_int("min_samples_leaf",1,20),
        random_state=42)
    return cross_val_score(m, X_tr, y_tr, cv=3, scoring="roc_auc").mean()

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=40, show_progress_bar=True)
print(f"Optuna best AUC: {study.best_value:.4f}")
print("Best params:", study.best_params)

Live Simulation — Parameter Search Heatmap

⚡ Simulated AUC across max_depth — highlighted row = your selection
8

Try it yourself

  1. Time both searches. How much faster is Random Search for similar AUC?
  2. Increase Optuna to n_trials=80. Does it keep improving or plateau?
  3. Add max_features=trial.suggest_float("max_features",0.3,1.0) to the Optuna objective.
Week 4 Takeaway

Random Search beats Grid Search in time-per-quality almost always. Use Optuna when each training run is expensive — it is the same logic as running only your highest-ROI campaigns once you have learned which levers matter.

Week 5 · Pipelines & Feature Engineering

Garbage In, Garbage Out

Feature engineering is where marketing domain knowledge pays off most. A Pipeline bundles data preparation with your model so that the same transformations applied during training are automatically applied to new data at prediction time.

RFM — the original marketing ML feature: Recency, Frequency, Monetary — three dimensions that have predicted customer value since the 1980s. This week you build them from scratch from raw transaction logs.

EXAMPLE 5.1 Data Leakage Without Pipeline vs. The Correct Way

▶ Open in Colab
Python · Leakage demo
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np

np.random.seed(42)
X = np.random.randn(1000, 20)   # 20 pure noise features
y = np.random.randint(0, 2, 1000) # random labels → model should score ~50%

# ❌ WRONG: scale ALL data before splitting
X_sc = StandardScaler().fit_transform(X)   # leaks train stats into test!
X_tr,X_te,y_tr,y_te = train_test_split(X_sc, y, test_size=.2)
bad = LogisticRegression().fit(X_tr, y_tr)
print(f"Leaky score:   {bad.score(X_te, y_te):.3f}")

# ✅ RIGHT: scaler inside Pipeline — refitted per fold
pipe = Pipeline([("sc",StandardScaler()), ("clf",LogisticRegression())])
cv = cross_val_score(pipe, X, y, cv=5)
print(f"Pipeline CV:   {cv.mean():.3f} ± {cv.std():.3f}")  # ~0.50 — honest

EXAMPLE 5.2 ColumnTransformer — Mixed Data Types

Python · ColumnTransformer
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

df_m = pd.DataFrame({
    "age":    [25,34,None,45,28],
    "spend":  [200,450,120,None,310],
    "channel":["email","social","email","direct","social"],
    "churned":[0,0,1,0,1],
})
num = Pipeline([("imp",SimpleImputer(strategy="median")),("sc",StandardScaler())])
cat = Pipeline([("imp",SimpleImputer(strategy="most_frequent")),("ohe",OneHotEncoder(handle_unknown="ignore"))])

prep = ColumnTransformer([("num",num,["age","spend"]),("cat",cat,["channel"])])
full = Pipeline([("prep",prep),("clf",RandomForestClassifier())])
print(full)

EXAMPLE 5.3 Building RFM Features from Transaction Logs

Python · RFM Feature Engineering
import pandas as pd; import numpy as np
np.random.seed(7); n_tx = 2000

tx = pd.DataFrame({
    "customer_id": np.random.randint(1,401,n_tx),
    "date": pd.to_datetime("2025-01-01") + pd.to_timedelta(np.random.randint(0,365,n_tx),unit="D"),
    "amount": np.random.exponential(80, n_tx).round(2),
})

snap = tx["date"].max() + pd.Timedelta(days=1)
rfm = tx.groupby("customer_id").agg(
    Recency   = ("date",   lambda d: (snap-d.max()).days),
    Frequency = ("date",   "count"),
    Monetary  = ("amount", "sum"),
).reset_index()

for c in ["Frequency","Monetary"]:
    rfm[f"{c}_score"] = pd.qcut(rfm[c], q=5, labels=[1,2,3,4,5])
rfm["Recency_score"] = pd.qcut(rfm["Recency"], q=5, labels=[5,4,3,2,1])
rfm["RFM"] = rfm["Recency_score"].astype(int)+rfm["Frequency_score"].astype(int)+rfm["Monetary_score"].astype(int)
print(rfm.head())

Try it yourself

  1. Add avg_order_value = Monetary / Frequency. Does it improve churn prediction?
  2. Segment by RFM: Champions (≥13), At Risk (7–9), Lost (<7). How many customers in each?
  3. Add tenure (days since first purchase) as a 4th feature. Does it predict churn?
Week 5 Takeaway

A Pipeline encoding RFM features, segment flags, and seasonality almost always outperforms a raw-feature model. Feature engineering is where your marketing expertise creates an unfair advantage over pure-data approaches.

Week 6 · Tree Models & Ensemble

From One Tree to a Forest

A single decision tree is interpretable but unstable. Ensemble methods combine many trees. Bagging (Random Forest) builds trees in parallel on random subsets. Boosting (XGBoost, LightGBM) builds trees sequentially, each correcting the previous one's mistakes.

The marketing analogy: A single sales forecast from one analyst can be badly wrong. Average forecasts from 500 independently-briefed analysts (Random Forest) and you get something far more reliable. Boosting is like running the analysis, then asking a second analyst to focus only on the cases the first got wrong.

EXAMPLE 6.1 Decision Tree — Interpretable but Fragile

▶ Open in Colab
Python · Decision Tree
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

dt = DecisionTreeClassifier(max_depth=4, random_state=42).fit(X_tr, y_tr)
print(f"Train: {dt.score(X_tr,y_tr):.3f}  Test: {dt.score(X_te,y_te):.3f}")

fig, ax = plt.subplots(figsize=(18,6))
plot_tree(dt, max_depth=3, filled=True, feature_names=[f"f{i}" for i in range(X_tr.shape[1])], ax=ax)
plt.tight_layout(); plt.show()

dt_overfit = DecisionTreeClassifier(random_state=42).fit(X_tr, y_tr)  # no depth limit
print(f"Overfit — Train: {dt_overfit.score(X_tr,y_tr):.3f}  Test: {dt_overfit.score(X_te,y_te):.3f}")

EXAMPLE 6.2 RF vs XGBoost vs LightGBM Head-to-Head

Python · Ensemble comparison
!pip install xgboost lightgbm --quiet
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
import time

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100,random_state=42),
    "XGBoost":       XGBClassifier(n_estimators=100,random_state=42,eval_metric="auc",verbosity=0),
    "LightGBM":      LGBMClassifier(n_estimators=100,random_state=42,verbose=-1),
}
for name, m in models.items():
    t = time.time(); m.fit(X_tr, y_tr)
    auc = roc_auc_score(y_te, m.predict_proba(X_te)[:,1])
    print(f"{name:15s}  AUC={auc:.4f}  Time={time.time()-t:.2f}s")

Live Simulation — OOB Error vs. Number of Trees

⚡ More trees = more stable — watch the error floor flatten out
50
When to use which

Random Forest: great default, robust to hyperparameters. XGBoost: Kaggle workhorse, very accurate with tuning. LightGBM: fastest on large datasets. Start with LightGBM for datasets over 100k rows.

Try it yourself

  1. Print feature importances: pd.Series(rf.feature_importances_).sort_values(ascending=False).head(10)
  2. Try XGBClassifier(scale_pos_weight=8) for class imbalance. Does AUC improve?
  3. Build a simple stack: use RF and XGB predictions as features for a Logistic Regression meta-model.
Week 6 Takeaway

LightGBM is your default starting point for tabular marketing data — fast, accurate, handles missing values natively. Reserve XGBoost for when you need maximum accuracy and have time to tune.

Week 7 · AutoML

Let the Machine Tune Itself

AutoML automates model selection, feature engineering, and hyperparameter tuning. It does not replace you — it handles repetitive search work so you can focus on defining the right problem and interpreting results for business action.

The critical mindset: A model with 0.94 AUC on a poorly-framed problem will still fail in production. Your marketing domain expertise defines the question. AutoML searches the answer space.

EXAMPLE 7.1 FLAML — 3 Lines to a Trained Model

▶ Open in Colab
Python · FLAML
!pip install flaml --quiet
from flaml import AutoML
from sklearn.metrics import roc_auc_score

automl = AutoML()
automl.fit(X_tr, y_tr, task="classification", metric="roc_auc", time_budget=60)
print(f"Best model:  {automl.best_estimator}")
print(f"Best config: {automl.best_config}")
print(f"Test AUC:    {roc_auc_score(y_te, automl.predict_proba(X_te)[:,1]):.4f}")
Free tier tip

FLAML works within Colab's free tier. Set time_budget=60 for quick experiments and time_budget=300 for production-quality results. No GPU needed for tabular marketing data.

EXAMPLE 7.2 AutoGluon — Model Leaderboard

Python · AutoGluon
# ⚠ AutoGluon is large (~1GB). Recommended: use Kaggle Notebooks (free, 30GB RAM)
!pip install autogluon.tabular --quiet
from autogluon.tabular import TabularPredictor
import pandas as pd

train_df = pd.DataFrame(X_tr); train_df["target"] = y_tr.values
predictor = TabularPredictor(label="target", eval_metric="roc_auc")
predictor.fit(train_df, time_limit=120, presets="medium_quality")

test_df = pd.DataFrame(X_te)
lb = predictor.leaderboard(test_df.assign(target=y_te), silent=True)
print(lb[["model","score_test","fit_time"]].head(8))

Try it yourself

  1. Change FLAML's metric to "f1". Does it select a different best model?
  2. Compare FLAML AUC vs. your best manually-tuned model from Week 4.
  3. Read automl.best_config. Can you see the hyperparameters AutoML discovered?
Week 7 Takeaway

AutoML beats a default sklearn model nearly every time, and gets you 90% of an expert's result in 5% of the time. For marketing pilots and quick proof-of-concepts, that is the right trade-off.

Week 8 · Deployment

From Notebook to Live App

A model that lives only in a Colab notebook has zero business value. Gradio turns a Python function into a web app in minutes. Hugging Face Spaces hosts it for free with a shareable link — no servers, no DevOps.

The restaurant analogy: Training a model = developing the recipe. Deployment = opening the restaurant. The best recipe in the world has no revenue until customers can actually order the dish.

EXAMPLE 8.1 Local Gradio Demo in 10 Lines

▶ Open in Colab
Python · Gradio demo
!pip install gradio --quiet
import gradio as gr; import joblib; import numpy as np

joblib.dump(automl, "churn_model.pkl")
model = joblib.load("churn_model.pkl")

def predict_churn(recency, frequency, monetary, tenure):
    prob = model.predict_proba(np.array([[recency,frequency,monetary,tenure]]))[0,1]
    label = "🔴 High Risk" if prob > .5 else "🟢 Low Risk"
    return {label: float(prob), "Stay": 1-float(prob)}

gr.Interface(
    fn=predict_churn,
    inputs=[gr.Slider(0,365,label="Days Since Last Purchase"),
            gr.Slider(1,50,label="Purchase Frequency"),
            gr.Slider(0,5000,label="Total Spend (AED)"),
            gr.Slider(0,1000,label="Account Age (days)")],
    outputs=gr.Label(label="Churn Probability"),
    title="🎯 Customer Churn Predictor",
).launch(share=True)

EXAMPLE 8.2 Full app.py for Hugging Face Spaces

app.py
# Upload this + model.pkl + requirements.txt to your HF Space
import gradio as gr
import joblib, pandas as pd, numpy as np

model = joblib.load("model.pkl")
FEATURES = ["recency","frequency","monetary","tenure_days"]

def predict(*args):
    df = pd.DataFrame([dict(zip(FEATURES, args))])
    prob = model.predict_proba(df)[0,1]
    risk = "🔴 High" if prob>.6 else ("🟡 Medium" if prob>.3 else "🟢 Low")
    return f"{risk} churn risk — {prob:.1%}"

gr.Interface(fn=predict,
    inputs=[gr.Number(label="Recency (days)"), gr.Number(label="Frequency"),
            gr.Number(label="Monetary (AED)"), gr.Number(label="Tenure (days)")],
    outputs="text", title="HCT ML in Marketing — Churn Predictor"
).launch()
HF Spaces deployment steps

1. Create an account at huggingface.co → 2. New Space → Gradio SDK → 3. Upload app.py, model.pkl, requirements.txt → 4. HF builds and hosts your app automatically. Shareable URL: your-name-space-name.hf.space

Try it yourself

  1. Add a CSV upload with gr.File() for bulk predictions.
  2. Push your files to a GitHub repo and enable GitHub sync in HF Spaces settings.
  3. Share your Space URL with a classmate. Can they get a prediction with zero coding?
Week 8 Takeaway

Handing a non-technical stakeholder a URL and saying "just upload your data here" converts ML from a data science project into a business tool — and that conversation is what justifies the investment.

Week 9 · Full Project

Real Kaggle Dataset: End-to-End

This week you put everything together on a real-world marketing dataset. This is your course capstone: EDA → feature engineering → Pipeline → cross-validation → AutoML → deployed Gradio app.

Recommended datasets (all free on Kaggle): Telco Customer Churn · Bank Marketing Response · E-Commerce Shipping · Online Retail II. Pick the one closest to your intended industry.

EXAMPLE 9.1 Full EDA Template

▶ Open in Colab
Python · EDA
import pandas as pd; import seaborn as sns; import matplotlib.pyplot as plt

df = pd.read_csv("your_dataset.csv")
print("Shape:", df.shape)
print((df.isnull().mean() * 100).round(1).sort_values(ascending=False).head(10))
print(df["target"].value_counts(normalize=True))

fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(df.select_dtypes("number").corr(), cmap="coolwarm", center=0, ax=ax, annot=True, fmt=".1f")
plt.tight_layout(); plt.show()

EXAMPLE 9.2 Pipeline + FLAML + SHAP Explainability

Python · Full workflow
!pip install flaml shap --quiet
from flaml import AutoML
from sklearn.model_selection import StratifiedKFold, cross_val_score
import shap

automl = AutoML()
automl.fit(X_tr, y_tr, task="classification", metric="roc_auc", time_budget=120)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_auc = cross_val_score(automl.model.estimator, X, y, cv=cv, scoring="roc_auc")
print(f"CV AUC: {cv_auc.mean():.4f} ± {cv_auc.std():.4f}")

explainer = shap.TreeExplainer(automl.model.estimator)
shap_vals = explainer.shap_values(X_te)
shap.summary_plot(shap_vals, X_te, plot_type="bar")
Deliverable checklist

① EDA notebook (≥5 visualisations) · ② ML pipeline with cross-validated AUC · ③ SHAP feature importance plot · ④ Live Gradio app on HF Spaces · ⑤ GitHub repo with README explaining the business problem

Recommended GitHub structure
ml-marketing-project/
├── data/           # README with Kaggle link only
├── notebooks/
│   ├── 01_eda.ipynb
│   └── 02_modelling.ipynb
├── app.py          # Gradio app
├── model.pkl
├── requirements.txt
└── README.md       # business problem + HF Spaces link
Week 9 Takeaway

A complete project — EDA + model + CV evaluation + deployed app + GitHub repo — is the deliverable that goes in your portfolio. It shows you can work end-to-end, not just run individual cells.

Reference · Appendix A

Cheatsheet & Debugging Guide

sklearn Quick Reference

OperationCodeWhen to use
Split datatrain_test_split(X,y,test_size=.2,stratify=y)Always stratify for classification
Cross-validatecross_val_score(pipe,X,y,cv=StratifiedKFold(5))For reliable metric estimation
Build pipelinePipeline([("sc",Scaler()),("clf",Model())])Any time you scale or encode
Handle missingSimpleImputer(strategy="median")Numeric columns with NaN
Encode categoriesOneHotEncoder(handle_unknown="ignore")Nominal categories (<20 values)
Grid searchGridSearchCV(model,params,cv=5,n_jobs=-1)<200 total combinations
Random searchRandomizedSearchCV(...,n_iter=30)Large parameter spaces
Save modeljoblib.dump(model,"model.pkl")Before deployment
Load modelmodel=joblib.load("model.pkl")In app.py / at prediction time

Common Errors & Fixes

ErrorCauseFix
ValueError: could not convert stringCategorical column not encodedAdd OneHotEncoder in ColumnTransformer
KeyError: "column"Typo or wrong dataset loadedCheck df.columns.tolist()
DataConversionWarningMixed dtypes in arraydf[col]=df[col].astype(float)
MemoryErrorDataset too large for free Colabdf=df.sample(50000) or use Kaggle Notebooks
ModuleNotFoundErrorLibrary not installedRun !pip install library_name
Train AUC=1.0, Test AUC=0.6OverfittingReduce max_depth, add regularisation
AUC≈0.5 after AutoMLData leakage or wrong labelCheck if label-derived features are in X

Free Tools & Resources

  • Compute: Google Colab (colab.research.google.com) · Kaggle Notebooks (kaggle.com/code) — both free, no installation
  • Models & Data: Hugging Face (huggingface.co) · Kaggle Datasets (kaggle.com/datasets)
  • Deployment: Hugging Face Spaces (free Gradio SDK) · GitHub (free public repos)
  • Free APIs: HF Inference API · Cohere Trial Key · Google AI Studio (aistudio.google.com)
  • Docs: scikit-learn.org/stable · flaml.ai/docs · optuna.readthedocs.io · gradio.app/docs
Learning tip

After reading each week's material, open a blank Colab notebook and retype one example from memory — looking back only when stuck. 30 minutes of effortful recall beats 3 hours of passive reading.

Final word

You have now seen the complete modern ML stack — from raw data to a live deployed application. The tools are free. The knowledge is in your hands. The only remaining variable is practice.