Build the data pipelines that feed production AI systems. Over four weeks, master SQL-based ETL, unstructured PDF parsing, multimodal semantic chunking, and unsupervised data clustering โ ending with a multimodal knowledge ingestor.
No rating yet
4 weeks
What You'll Learn
Build orchestrated ETL pipelines using Dagster with incremental loading
Parse and extract structured data from PDFs using PyMuPDF and unstructured.io
Implement fixed, recursive, semantic, and agentic chunking strategies
Reduce high-dimensional data with UMAP and cluster it with HDBSCAN
Benchmark pipeline quality using CCT metrics
Course Content
W1
Week 1: Traditional ETL & Structured Data
Build reliable, orchestrated pipelines for structured data sources.
1
SQL Fundamentals & Aggregations
Writing production SQL โ joins, window functions, CTEs โ to extract and aggregate data from relational sources.
2
ETL Architecture & Data Contracts
Designing ETL pipelines with explicit data contracts that catch schema drift before it breaks downstream models.
3
Pipeline Orchestration with Dagster
Defining assets and jobs in Dagster to schedule, monitor, and retry ETL pipelines with full observability.
4
Incremental Loading & API Limits
Implementing cursor-based incremental loads to sync only new records while respecting API rate limits.
Weekly Win
External Data Integration
Build a Dagster pipeline that incrementally loads data from an external API into a structured store with schema validation at every step.
W2
Week 2: Parsing Unstructured Text & PDFs
Extract clean, structured data from messy real-world documents.
1
Document Layout Analysis
Detecting bounding boxes and understanding page layout to correctly separate headers, body text, tables, and figures.
2
Slicing PDFs with PyMuPDF
Dynamically slicing PDF documents based on table-of-contents metadata to extract section-level content programmatically.
3
Extraction with unstructured.io
Using the unstructured.io API to programmatically extract clean text from PDFs, DOCX, HTML, and image-heavy documents.
4
Multimodal Routing
Routing tables, charts, and images to specialized extraction handlers instead of treating all content as plain text.
Weekly Win
Pipeline Quality Benchmarking
Benchmark extraction pipeline quality using CCT (Character, Chunk, Token) metrics across multiple document types.
W3
Week 3: Multimodal Ingestion & Semantic Chunking
Chunk documents intelligently so retrievers find the right context every time.
1
Context Window Constraints & Token Budgets
Understanding how context window limits dictate chunking strategy and how to calculate token budgets per retrieval.
2
Fixed & Recursive Chunking
Implementing fixed-size and recursive character-based chunking and measuring their impact on retrieval precision.
3
Semantic Chunking Theory
Splitting documents at semantic boundaries by measuring embedding distance between adjacent sentences.
4
Advanced Boundaries via Dynamic Programming
Using dynamic programming to find globally optimal chunk boundaries that minimize semantic fragmentation.
Weekly Win
Agentic Chunking & Contextual Enrichment
Build an agentic chunker that uses an LLM to propose chunk boundaries and enrich each chunk with contextual metadata.
W4
Week 4: Data Clustering & Capstone
Discover hidden structure in document collections without labels.
1
Dimensionality Reduction with UMAP
Projecting high-dimensional embeddings into 2D or 3D with UMAP to visualize document clusters and outliers.
2
Density-Based Clustering with HDBSCAN
Discovering variable-density document clusters without specifying the number of clusters in advance.
3
Topic Modeling for User Intent
Applying topic modeling over clusters to automatically label document groups with human-readable intent descriptions.
4
Parameter Optimization for Cluster Sizes
Tuning UMAP and HDBSCAN hyperparameters using silhouette scores to produce coherent, well-separated clusters.
Weekly Win
Capstone: Multimodal Knowledge Ingestor
Build a multimodal knowledge ingestor that parses mixed-format documents, chunks semantically, clusters by intent, and indexes into a vector store.