Digital Marketing,
Marketing Analytics & AI+
A hands-on, 10-week applied course for non-programmers. Every concept starts with a marketing analogy, then becomes runnable Python code. No prior coding experience required — just business curiosity and a Google account.
- Weeks
- 10 modules
- Tools
- Colab · HF · Gradio · GitHub
- AI Co-Pilot
- Student & Manager modes
- Final Project
- Live Kaggle deployment
Why ML for Marketing?
Machine learning is not magic — it is a systematic way of finding patterns in data at a scale humans cannot match manually. As a marketer, you already possess the most valuable skill: you understand what question to ask. ML gives you the tools to answer it at scale.
The core loop: You have data about customers → ML finds the pattern → the pattern predicts future behaviour → you act on that prediction. Everything in this course is a variation on that loop.
We distinguish two types: Supervised (you provide the answer key — e.g., which customers churned) and Unsupervised (the algorithm discovers structure — e.g., natural customer segments). Marketing primarily uses supervised learning.
Key vocabulary: A feature is any known information about a customer (age, purchase history, email open rate). A label is what you want to predict (did they buy? will they churn?). Your model learns the relationship between features and labels from historical data.
EXAMPLE 1.1 Setting Up Your Colab Environment
▶ Open in Colab# ── Cell 1: Install & import everything for this course ──
!pip install scikit-learn pandas matplotlib seaborn --quiet
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print("✅ Environment ready! Let's do some marketing ML.")
EXAMPLE 1.2 Loading a Real Marketing Dataset
# ── Cell 2: Load UCI Bank Marketing dataset ──
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/bank-additional.csv"
df = pd.read_csv(url, sep=";")
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
df.head()
EXAMPLE 1.3 Your First Exploratory Data Analysis
# ── Cell 3: Understand the target variable ──
print("=== Target Distribution ===")
print(df["y"].value_counts())
print(df["y"].value_counts(normalize=True).round(3))
fig, ax = plt.subplots(figsize=(8,4))
for outcome in ["no", "yes"]:
ax.hist(df[df["y"]==outcome]["age"], bins=30, alpha=.6, label=f"Subscribed: {outcome}")
ax.set_xlabel("Age"); ax.legend(); plt.tight_layout(); plt.show()
Only 11.3% of contacts subscribed — a class imbalance. If we predicted "no" for everyone, we'd be right 88.7% of the time! This is why Accuracy is misleading for marketing ML. We fix this in Week 3.
Try it yourself
- How many unique job types are in the dataset? Use
df["job"].value_counts() - What is the average campaign calls for subscribers vs non-subscribers? Use
df.groupby("y")["campaign"].mean() - Install Hugging Face datasets:
!pip install datasetsthenfrom datasets import load_dataset
ML does not replace marketing judgment — it amplifies it. Your job is to know what question to ask; Python's job is to find the pattern at scale.
sklearn Blocks & Your First Model
The scikit-learn library follows one consistent interface: fit → predict → score. Learn these three methods once and you can use any of the 50+ models in the library. The pattern is identical whether predicting house prices or customer churn.
The fundamental principle: We train on past data to predict future data we have never seen. This is why we hold out some data for testing — testing on training data just checks whether the model memorised history, not whether it can generalise.
EXAMPLE 2.1 Train/Test Split + Linear Regression
▶ Open in Colabimport pandas as pd; import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
np.random.seed(42); n = 500
df = pd.DataFrame({
"ad_spend": np.random.uniform(1000, 50000, n),
"email_opens": np.random.randint(50, 2000, n),
"social_posts":np.random.randint(5, 100, n),
})
df["revenue"] = (3.5*df["ad_spend"] + 12*df["email_opens"]
+ 300*df["social_posts"] + np.random.normal(0,8000,n))
X = df[["ad_spend","email_opens","social_posts"]]; y = df["revenue"]
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=.2, random_state=42)
model = LinearRegression().fit(X_train, y_train)
print(f"R² on test set: {model.score(X_test, y_test):.3f}")
print("Coefficients:", dict(zip(X.columns, model.coef_.round(2))))
EXAMPLE 2.2 Classification — Predicting Purchase Intent
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Numeric columns from the bank dataset loaded in Week 1
num_cols = ["age","campaign","pdays","previous"]
X2 = df[num_cols].fillna(0)
y2 = (df["y"] == "yes").astype(int)
X_tr,X_te,y_tr,y_te = train_test_split(X2, y2, test_size=.2, random_state=42)
clf = LogisticRegression(max_iter=1000).fit(X_tr, y_tr)
print(classification_report(y_te, clf.predict(X_te)))
fig, ax = plt.subplots(figsize=(4,4))
ConfusionMatrixDisplay.from_estimator(clf, X_te, y_te, ax=ax)
plt.tight_layout(); plt.show()
Live Simulation — Train/Test Split Visualiser
Regression predicts a number (lifetime value, expected spend). Classification predicts a category (will buy / won't buy). The sklearn interface is identical — only the output and metric change.
Try it yourself
- Change
test_size=0.3. Does R² improve or worsen? Why? - Try
LogisticRegression(class_weight="balanced"). How does the recall for class 1 change? - Use the model to predict revenue for a new campaign:
model.predict([[20000, 500, 30]])
sklearn's unified fit → predict → score interface means you can swap any model with two lines of code. Master the workflow once, then experiment freely.
Are We Actually Good?
A single train/test split is like running one A/B test in one city and calling it global truth. The result depends on which 20% you happened to hold out — different random seeds give meaningfully different scores. Cross-validation fixes this by rotating through every part of your data as the test set.
The K-Fold analogy: You want to test whether a new loyalty email works. Instead of testing only on your Dubai customers, you test on Dubai, Abu Dhabi, Sharjah, Al Ain, and RAK separately, then average. That is 5-Fold Cross-Validation. Each emirate is one "fold."
EXAMPLE 3.1 Why a Single Split Lies — K-Fold Solution
▶ Open in Colabimport numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, random_state=0)
model = RandomForestClassifier(n_estimators=50, random_state=1)
# Same model, 20 different splits → how much do scores vary?
scores = [train_test_split(X,y,test_size=.2,random_state=s) for s in range(20)]
single_scores = [model.fit(Xtr,ytr).score(Xte,yte) for Xtr,Xte,ytr,yte in scores]
print(f"Single-split range: {min(single_scores):.3f} to {max(single_scores):.3f}")
# 5-Fold CV: stable mean ± honest uncertainty
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv = cross_val_score(model, X, y, cv=skf, scoring="accuracy")
print(f"5-Fold CV: {cv.mean():.3f} ± {cv.std():.3f}")
print(f"Per-fold: {cv.round(3)}")
EXAMPLE 3.2 Evaluation Metrics — The Full Picture
from sklearn.metrics import classification_report, roc_auc_score, RocCurveDisplay
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
X, y = make_classification(n_samples=5000, weights=[.89,.11], random_state=0)
X_tr,X_te,y_tr,y_te = train_test_split(X, y, test_size=.2, stratify=y, random_state=42)
naive = np.zeros_like(y_te)
print(f"Naive accuracy (predict all 'no'): {(naive==y_te).mean():.3f}")
clf = LogisticRegression(class_weight="balanced", max_iter=1000).fit(X_tr, y_tr)
print(classification_report(y_te, clf.predict(X_te)))
print(f"AUC-ROC: {roc_auc_score(y_te, clf.predict_proba(X_te)[:,1]):.3f}")
fig, ax = plt.subplots(figsize=(5,5))
RocCurveDisplay.from_estimator(clf, X_te, y_te, ax=ax)
ax.plot([0,1],[0,1],"k--",label="Random"); ax.legend(); plt.show()
EXAMPLE 3.3 Stratified K-Fold Inside a Pipeline — The Correct Way
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_validate, StratifiedKFold
import pandas as pd
pipe = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(class_weight="balanced", max_iter=1000)),
])
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
results = cross_validate(pipe, X, y, cv=skf,
scoring=["accuracy","f1","roc_auc"],
return_train_score=True)
summary = pd.DataFrame({
m: {"mean":results[f"test_{m}"].mean(), "std":results[f"test_{m}"].std()}
for m in ["accuracy","f1","roc_auc"]
}).T.round(3)
print(summary)
Always run cross-validation inside a Pipeline. If you scale all data first and then CV, test folds can "see" training statistics — that is data leakage. The Pipeline refits the scaler only on each training fold automatically.
Live Simulation — K-Fold Visualiser
Try it yourself
- Compare
cross_val_score(..., scoring="f1")vsscoring="accuracy"on imbalanced data. Which tells the truer story? - Try
cv=10. Does the mean change much? Does the standard deviation go up or down? - Look up
TimeSeriesSplitin sklearn docs. Why would you need this for weekly campaign data?
Always report CV mean ± std, not a single test score. For imbalanced marketing data, AUC-ROC and F1 tell the truth that Accuracy hides. Always keep preprocessing inside a Pipeline when cross-validating.
Finding the Best Settings
Every ML model ships with default settings that are "good enough" — but not optimal for your data. Hyperparameters are knobs you control before training (like max_depth or learning_rate). Tuning is systematically finding better ones.
Three strategies: Grid Search — try every combination (thorough, slow). Random Search — try random combinations (80% of the benefit at 20% of the cost). Bayesian Optuna — use past results to guide the search intelligently.
EXAMPLE 4.1 Grid Search & Random Search
▶ Open in Colabfrom sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import time
X, y = make_classification(n_samples=2000, n_features=15, random_state=0)
X_tr,X_te,y_tr,y_te = train_test_split(X, y, test_size=.2, random_state=42)
param_grid = {"max_depth":[3,5,8,12], "n_estimators":[50,100,200], "min_samples_leaf":[1,5,10]}
t0 = time.time()
gs = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring="roc_auc", n_jobs=-1)
gs.fit(X_tr, y_tr)
print(f"Grid Search ⏱ {time.time()-t0:.1f}s | Best AUC: {gs.best_score_:.4f}")
print("Best:", gs.best_params_)
t0 = time.time()
rs = RandomizedSearchCV(RandomForestClassifier(random_state=42), param_grid, n_iter=20, cv=5, scoring="roc_auc", random_state=42, n_jobs=-1)
rs.fit(X_tr, y_tr)
print(f"Random Search ⏱ {time.time()-t0:.1f}s | Best AUC: {rs.best_score_:.4f}")
EXAMPLE 4.2 Bayesian Tuning with Optuna
!pip install optuna --quiet
import optuna; optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
m = RandomForestClassifier(
n_estimators =trial.suggest_int("n_estimators",50,300),
max_depth =trial.suggest_int("max_depth",2,15),
min_samples_leaf=trial.suggest_int("min_samples_leaf",1,20),
random_state=42)
return cross_val_score(m, X_tr, y_tr, cv=3, scoring="roc_auc").mean()
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=40, show_progress_bar=True)
print(f"Optuna best AUC: {study.best_value:.4f}")
print("Best params:", study.best_params)
Live Simulation — Parameter Search Heatmap
Try it yourself
- Time both searches. How much faster is Random Search for similar AUC?
- Increase Optuna to
n_trials=80. Does it keep improving or plateau? - Add
max_features=trial.suggest_float("max_features",0.3,1.0)to the Optuna objective.
Random Search beats Grid Search in time-per-quality almost always. Use Optuna when each training run is expensive — it is the same logic as running only your highest-ROI campaigns once you have learned which levers matter.
Garbage In, Garbage Out
Feature engineering is where marketing domain knowledge pays off most. A Pipeline bundles data preparation with your model so that the same transformations applied during training are automatically applied to new data at prediction time.
RFM — the original marketing ML feature: Recency, Frequency, Monetary — three dimensions that have predicted customer value since the 1980s. This week you build them from scratch from raw transaction logs.
EXAMPLE 5.1 Data Leakage Without Pipeline vs. The Correct Way
▶ Open in Colabfrom sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import numpy as np
np.random.seed(42)
X = np.random.randn(1000, 20) # 20 pure noise features
y = np.random.randint(0, 2, 1000) # random labels → model should score ~50%
# ❌ WRONG: scale ALL data before splitting
X_sc = StandardScaler().fit_transform(X) # leaks train stats into test!
X_tr,X_te,y_tr,y_te = train_test_split(X_sc, y, test_size=.2)
bad = LogisticRegression().fit(X_tr, y_tr)
print(f"Leaky score: {bad.score(X_te, y_te):.3f}")
# ✅ RIGHT: scaler inside Pipeline — refitted per fold
pipe = Pipeline([("sc",StandardScaler()), ("clf",LogisticRegression())])
cv = cross_val_score(pipe, X, y, cv=5)
print(f"Pipeline CV: {cv.mean():.3f} ± {cv.std():.3f}") # ~0.50 — honest
EXAMPLE 5.2 ColumnTransformer — Mixed Data Types
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
df_m = pd.DataFrame({
"age": [25,34,None,45,28],
"spend": [200,450,120,None,310],
"channel":["email","social","email","direct","social"],
"churned":[0,0,1,0,1],
})
num = Pipeline([("imp",SimpleImputer(strategy="median")),("sc",StandardScaler())])
cat = Pipeline([("imp",SimpleImputer(strategy="most_frequent")),("ohe",OneHotEncoder(handle_unknown="ignore"))])
prep = ColumnTransformer([("num",num,["age","spend"]),("cat",cat,["channel"])])
full = Pipeline([("prep",prep),("clf",RandomForestClassifier())])
print(full)
EXAMPLE 5.3 Building RFM Features from Transaction Logs
import pandas as pd; import numpy as np
np.random.seed(7); n_tx = 2000
tx = pd.DataFrame({
"customer_id": np.random.randint(1,401,n_tx),
"date": pd.to_datetime("2025-01-01") + pd.to_timedelta(np.random.randint(0,365,n_tx),unit="D"),
"amount": np.random.exponential(80, n_tx).round(2),
})
snap = tx["date"].max() + pd.Timedelta(days=1)
rfm = tx.groupby("customer_id").agg(
Recency = ("date", lambda d: (snap-d.max()).days),
Frequency = ("date", "count"),
Monetary = ("amount", "sum"),
).reset_index()
for c in ["Frequency","Monetary"]:
rfm[f"{c}_score"] = pd.qcut(rfm[c], q=5, labels=[1,2,3,4,5])
rfm["Recency_score"] = pd.qcut(rfm["Recency"], q=5, labels=[5,4,3,2,1])
rfm["RFM"] = rfm["Recency_score"].astype(int)+rfm["Frequency_score"].astype(int)+rfm["Monetary_score"].astype(int)
print(rfm.head())
Try it yourself
- Add
avg_order_value = Monetary / Frequency. Does it improve churn prediction? - Segment by RFM: Champions (≥13), At Risk (7–9), Lost (<7). How many customers in each?
- Add tenure (days since first purchase) as a 4th feature. Does it predict churn?
A Pipeline encoding RFM features, segment flags, and seasonality almost always outperforms a raw-feature model. Feature engineering is where your marketing expertise creates an unfair advantage over pure-data approaches.
From One Tree to a Forest
A single decision tree is interpretable but unstable. Ensemble methods combine many trees. Bagging (Random Forest) builds trees in parallel on random subsets. Boosting (XGBoost, LightGBM) builds trees sequentially, each correcting the previous one's mistakes.
The marketing analogy: A single sales forecast from one analyst can be badly wrong. Average forecasts from 500 independently-briefed analysts (Random Forest) and you get something far more reliable. Boosting is like running the analysis, then asking a second analyst to focus only on the cases the first got wrong.
EXAMPLE 6.1 Decision Tree — Interpretable but Fragile
▶ Open in Colabfrom sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt
dt = DecisionTreeClassifier(max_depth=4, random_state=42).fit(X_tr, y_tr)
print(f"Train: {dt.score(X_tr,y_tr):.3f} Test: {dt.score(X_te,y_te):.3f}")
fig, ax = plt.subplots(figsize=(18,6))
plot_tree(dt, max_depth=3, filled=True, feature_names=[f"f{i}" for i in range(X_tr.shape[1])], ax=ax)
plt.tight_layout(); plt.show()
dt_overfit = DecisionTreeClassifier(random_state=42).fit(X_tr, y_tr) # no depth limit
print(f"Overfit — Train: {dt_overfit.score(X_tr,y_tr):.3f} Test: {dt_overfit.score(X_te,y_te):.3f}")
EXAMPLE 6.2 RF vs XGBoost vs LightGBM Head-to-Head
!pip install xgboost lightgbm --quiet
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score
import time
models = {
"Random Forest": RandomForestClassifier(n_estimators=100,random_state=42),
"XGBoost": XGBClassifier(n_estimators=100,random_state=42,eval_metric="auc",verbosity=0),
"LightGBM": LGBMClassifier(n_estimators=100,random_state=42,verbose=-1),
}
for name, m in models.items():
t = time.time(); m.fit(X_tr, y_tr)
auc = roc_auc_score(y_te, m.predict_proba(X_te)[:,1])
print(f"{name:15s} AUC={auc:.4f} Time={time.time()-t:.2f}s")
Live Simulation — OOB Error vs. Number of Trees
Random Forest: great default, robust to hyperparameters. XGBoost: Kaggle workhorse, very accurate with tuning. LightGBM: fastest on large datasets. Start with LightGBM for datasets over 100k rows.
Try it yourself
- Print feature importances:
pd.Series(rf.feature_importances_).sort_values(ascending=False).head(10) - Try
XGBClassifier(scale_pos_weight=8)for class imbalance. Does AUC improve? - Build a simple stack: use RF and XGB predictions as features for a Logistic Regression meta-model.
LightGBM is your default starting point for tabular marketing data — fast, accurate, handles missing values natively. Reserve XGBoost for when you need maximum accuracy and have time to tune.
Let the Machine Tune Itself
AutoML automates model selection, feature engineering, and hyperparameter tuning. It does not replace you — it handles repetitive search work so you can focus on defining the right problem and interpreting results for business action.
The critical mindset: A model with 0.94 AUC on a poorly-framed problem will still fail in production. Your marketing domain expertise defines the question. AutoML searches the answer space.
EXAMPLE 7.1 FLAML — 3 Lines to a Trained Model
▶ Open in Colab!pip install flaml --quiet
from flaml import AutoML
from sklearn.metrics import roc_auc_score
automl = AutoML()
automl.fit(X_tr, y_tr, task="classification", metric="roc_auc", time_budget=60)
print(f"Best model: {automl.best_estimator}")
print(f"Best config: {automl.best_config}")
print(f"Test AUC: {roc_auc_score(y_te, automl.predict_proba(X_te)[:,1]):.4f}")
FLAML works within Colab's free tier. Set time_budget=60 for quick experiments and time_budget=300 for production-quality results. No GPU needed for tabular marketing data.
EXAMPLE 7.2 AutoGluon — Model Leaderboard
# ⚠ AutoGluon is large (~1GB). Recommended: use Kaggle Notebooks (free, 30GB RAM)
!pip install autogluon.tabular --quiet
from autogluon.tabular import TabularPredictor
import pandas as pd
train_df = pd.DataFrame(X_tr); train_df["target"] = y_tr.values
predictor = TabularPredictor(label="target", eval_metric="roc_auc")
predictor.fit(train_df, time_limit=120, presets="medium_quality")
test_df = pd.DataFrame(X_te)
lb = predictor.leaderboard(test_df.assign(target=y_te), silent=True)
print(lb[["model","score_test","fit_time"]].head(8))
Try it yourself
- Change FLAML's
metricto"f1". Does it select a different best model? - Compare FLAML AUC vs. your best manually-tuned model from Week 4.
- Read
automl.best_config. Can you see the hyperparameters AutoML discovered?
AutoML beats a default sklearn model nearly every time, and gets you 90% of an expert's result in 5% of the time. For marketing pilots and quick proof-of-concepts, that is the right trade-off.
From Notebook to Live App
A model that lives only in a Colab notebook has zero business value. Gradio turns a Python function into a web app in minutes. Hugging Face Spaces hosts it for free with a shareable link — no servers, no DevOps.
The restaurant analogy: Training a model = developing the recipe. Deployment = opening the restaurant. The best recipe in the world has no revenue until customers can actually order the dish.
EXAMPLE 8.1 Local Gradio Demo in 10 Lines
▶ Open in Colab!pip install gradio --quiet
import gradio as gr; import joblib; import numpy as np
joblib.dump(automl, "churn_model.pkl")
model = joblib.load("churn_model.pkl")
def predict_churn(recency, frequency, monetary, tenure):
prob = model.predict_proba(np.array([[recency,frequency,monetary,tenure]]))[0,1]
label = "🔴 High Risk" if prob > .5 else "🟢 Low Risk"
return {label: float(prob), "Stay": 1-float(prob)}
gr.Interface(
fn=predict_churn,
inputs=[gr.Slider(0,365,label="Days Since Last Purchase"),
gr.Slider(1,50,label="Purchase Frequency"),
gr.Slider(0,5000,label="Total Spend (AED)"),
gr.Slider(0,1000,label="Account Age (days)")],
outputs=gr.Label(label="Churn Probability"),
title="🎯 Customer Churn Predictor",
).launch(share=True)
EXAMPLE 8.2 Full app.py for Hugging Face Spaces
# Upload this + model.pkl + requirements.txt to your HF Space
import gradio as gr
import joblib, pandas as pd, numpy as np
model = joblib.load("model.pkl")
FEATURES = ["recency","frequency","monetary","tenure_days"]
def predict(*args):
df = pd.DataFrame([dict(zip(FEATURES, args))])
prob = model.predict_proba(df)[0,1]
risk = "🔴 High" if prob>.6 else ("🟡 Medium" if prob>.3 else "🟢 Low")
return f"{risk} churn risk — {prob:.1%}"
gr.Interface(fn=predict,
inputs=[gr.Number(label="Recency (days)"), gr.Number(label="Frequency"),
gr.Number(label="Monetary (AED)"), gr.Number(label="Tenure (days)")],
outputs="text", title="HCT ML in Marketing — Churn Predictor"
).launch()
1. Create an account at huggingface.co → 2. New Space → Gradio SDK → 3. Upload app.py, model.pkl, requirements.txt → 4. HF builds and hosts your app automatically. Shareable URL: your-name-space-name.hf.space
Try it yourself
- Add a CSV upload with
gr.File()for bulk predictions. - Push your files to a GitHub repo and enable GitHub sync in HF Spaces settings.
- Share your Space URL with a classmate. Can they get a prediction with zero coding?
Handing a non-technical stakeholder a URL and saying "just upload your data here" converts ML from a data science project into a business tool — and that conversation is what justifies the investment.
Real Kaggle Dataset: End-to-End
This week you put everything together on a real-world marketing dataset. This is your course capstone: EDA → feature engineering → Pipeline → cross-validation → AutoML → deployed Gradio app.
Recommended datasets (all free on Kaggle): Telco Customer Churn · Bank Marketing Response · E-Commerce Shipping · Online Retail II. Pick the one closest to your intended industry.
EXAMPLE 9.1 Full EDA Template
▶ Open in Colabimport pandas as pd; import seaborn as sns; import matplotlib.pyplot as plt
df = pd.read_csv("your_dataset.csv")
print("Shape:", df.shape)
print((df.isnull().mean() * 100).round(1).sort_values(ascending=False).head(10))
print(df["target"].value_counts(normalize=True))
fig, ax = plt.subplots(figsize=(10,7))
sns.heatmap(df.select_dtypes("number").corr(), cmap="coolwarm", center=0, ax=ax, annot=True, fmt=".1f")
plt.tight_layout(); plt.show()
EXAMPLE 9.2 Pipeline + FLAML + SHAP Explainability
!pip install flaml shap --quiet
from flaml import AutoML
from sklearn.model_selection import StratifiedKFold, cross_val_score
import shap
automl = AutoML()
automl.fit(X_tr, y_tr, task="classification", metric="roc_auc", time_budget=120)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_auc = cross_val_score(automl.model.estimator, X, y, cv=cv, scoring="roc_auc")
print(f"CV AUC: {cv_auc.mean():.4f} ± {cv_auc.std():.4f}")
explainer = shap.TreeExplainer(automl.model.estimator)
shap_vals = explainer.shap_values(X_te)
shap.summary_plot(shap_vals, X_te, plot_type="bar")
① EDA notebook (≥5 visualisations) · ② ML pipeline with cross-validated AUC · ③ SHAP feature importance plot · ④ Live Gradio app on HF Spaces · ⑤ GitHub repo with README explaining the business problem
ml-marketing-project/
├── data/ # README with Kaggle link only
├── notebooks/
│ ├── 01_eda.ipynb
│ └── 02_modelling.ipynb
├── app.py # Gradio app
├── model.pkl
├── requirements.txt
└── README.md # business problem + HF Spaces link
A complete project — EDA + model + CV evaluation + deployed app + GitHub repo — is the deliverable that goes in your portfolio. It shows you can work end-to-end, not just run individual cells.
Cheatsheet & Debugging Guide
sklearn Quick Reference
| Operation | Code | When to use |
|---|---|---|
| Split data | train_test_split(X,y,test_size=.2,stratify=y) | Always stratify for classification |
| Cross-validate | cross_val_score(pipe,X,y,cv=StratifiedKFold(5)) | For reliable metric estimation |
| Build pipeline | Pipeline([("sc",Scaler()),("clf",Model())]) | Any time you scale or encode |
| Handle missing | SimpleImputer(strategy="median") | Numeric columns with NaN |
| Encode categories | OneHotEncoder(handle_unknown="ignore") | Nominal categories (<20 values) |
| Grid search | GridSearchCV(model,params,cv=5,n_jobs=-1) | <200 total combinations |
| Random search | RandomizedSearchCV(...,n_iter=30) | Large parameter spaces |
| Save model | joblib.dump(model,"model.pkl") | Before deployment |
| Load model | model=joblib.load("model.pkl") | In app.py / at prediction time |
Common Errors & Fixes
| Error | Cause | Fix |
|---|---|---|
ValueError: could not convert string | Categorical column not encoded | Add OneHotEncoder in ColumnTransformer |
KeyError: "column" | Typo or wrong dataset loaded | Check df.columns.tolist() |
DataConversionWarning | Mixed dtypes in array | df[col]=df[col].astype(float) |
MemoryError | Dataset too large for free Colab | df=df.sample(50000) or use Kaggle Notebooks |
ModuleNotFoundError | Library not installed | Run !pip install library_name |
| Train AUC=1.0, Test AUC=0.6 | Overfitting | Reduce max_depth, add regularisation |
| AUC≈0.5 after AutoML | Data leakage or wrong label | Check if label-derived features are in X |
Free Tools & Resources
- Compute: Google Colab (colab.research.google.com) · Kaggle Notebooks (kaggle.com/code) — both free, no installation
- Models & Data: Hugging Face (huggingface.co) · Kaggle Datasets (kaggle.com/datasets)
- Deployment: Hugging Face Spaces (free Gradio SDK) · GitHub (free public repos)
- Free APIs: HF Inference API · Cohere Trial Key · Google AI Studio (aistudio.google.com)
- Docs: scikit-learn.org/stable · flaml.ai/docs · optuna.readthedocs.io · gradio.app/docs
After reading each week's material, open a blank Colab notebook and retype one example from memory — looking back only when stuck. 30 minutes of effortful recall beats 3 hours of passive reading.
You have now seen the complete modern ML stack — from raw data to a live deployed application. The tools are free. The knowledge is in your hands. The only remaining variable is practice.