QubitSkills - AI Training Platform

What You'll Learn

Build orchestrated ETL pipelines using Dagster with incremental loading

Parse and extract structured data from PDFs using PyMuPDF and unstructured.io

Implement fixed, recursive, semantic, and agentic chunking strategies

Reduce high-dimensional data with UMAP and cluster it with HDBSCAN

Benchmark pipeline quality using CCT metrics

Course Content

Week 1: Traditional ETL & Structured Data

Build reliable, orchestrated pipelines for structured data sources.

SQL Fundamentals & Aggregations

Writing production SQL — joins, window functions, CTEs — to extract and aggregate data from relational sources.

ETL Architecture & Data Contracts

Designing ETL pipelines with explicit data contracts that catch schema drift before it breaks downstream models.

Pipeline Orchestration with Dagster

Defining assets and jobs in Dagster to schedule, monitor, and retry ETL pipelines with full observability.

Incremental Loading & API Limits

Implementing cursor-based incremental loads to sync only new records while respecting API rate limits.

Weekly Win

External Data Integration

Build a Dagster pipeline that incrementally loads data from an external API into a structured store with schema validation at every step.

Week 2: Parsing Unstructured Text & PDFs

Extract clean, structured data from messy real-world documents.

Document Layout Analysis

Detecting bounding boxes and understanding page layout to correctly separate headers, body text, tables, and figures.

Slicing PDFs with PyMuPDF

Dynamically slicing PDF documents based on table-of-contents metadata to extract section-level content programmatically.

Extraction with unstructured.io

Using the unstructured.io API to programmatically extract clean text from PDFs, DOCX, HTML, and image-heavy documents.

Multimodal Routing

Routing tables, charts, and images to specialized extraction handlers instead of treating all content as plain text.

Weekly Win

Pipeline Quality Benchmarking

Benchmark extraction pipeline quality using CCT (Character, Chunk, Token) metrics across multiple document types.

Week 3: Multimodal Ingestion & Semantic Chunking

Chunk documents intelligently so retrievers find the right context every time.

Context Window Constraints & Token Budgets

Understanding how context window limits dictate chunking strategy and how to calculate token budgets per retrieval.

Fixed & Recursive Chunking

Implementing fixed-size and recursive character-based chunking and measuring their impact on retrieval precision.

Semantic Chunking Theory

Splitting documents at semantic boundaries by measuring embedding distance between adjacent sentences.

Advanced Boundaries via Dynamic Programming

Using dynamic programming to find globally optimal chunk boundaries that minimize semantic fragmentation.

Weekly Win

Agentic Chunking & Contextual Enrichment

Build an agentic chunker that uses an LLM to propose chunk boundaries and enrich each chunk with contextual metadata.

Week 4: Data Clustering & Capstone

Discover hidden structure in document collections without labels.

Dimensionality Reduction with UMAP

Projecting high-dimensional embeddings into 2D or 3D with UMAP to visualize document clusters and outliers.

Density-Based Clustering with HDBSCAN

Discovering variable-density document clusters without specifying the number of clusters in advance.

Topic Modeling for User Intent

Applying topic modeling over clusters to automatically label document groups with human-readable intent descriptions.

Parameter Optimization for Cluster Sizes

Tuning UMAP and HDBSCAN hyperparameters using silhouette scores to produce coherent, well-separated clusters.

Weekly Win

Capstone: Multimodal Knowledge Ingestor

Build a multimodal knowledge ingestor that parses mixed-format documents, chunks semantically, clusters by intent, and indexes into a vector store.

Prerequisites

Python programming

Basic SQL knowledge

📚

Intermediate Level

Course Price

₹9,999

India

$199

International · One-time payment

Next cohort starts Mar 30

Duration4 weeks

LevelIntermediate

FormatCohort-based

Modules4

What's included:

Live cohort sessions

Hands-on projects

Certificate of completion

Lifetime access

Career support

Part of Learning Track

🛠️

AI Builder

6 courses in track

AI Builder: Data Engineering for AI

What You'll Learn

Course Content

Prerequisites

What's included:

Part of Learning Track