Building a YouTube Analytics Backend with Python, Flask & MySQL

Ever wondered how platforms like Social Blade or VidIQ work under the hood? Today I'm walking you through building your own YouTube analytics backend from scratch. We'll fetch data from YouTube's API, crunch some numbers to extract insights, and store everything in a MySQL database.

Honestly speaking, this was one of those weekend projects that turned into something way more interesting than I expected. Let's dive in!

TL;DR: What We're Building

YouTube API integration to fetch video data
Smart data processing for engagement metrics and keyword extraction
MySQL database with SQLAlchemy ORM
Flask REST API for easy data access
Automated pipeline that updates data every 5 minutes
Real insights like viral categorization and engagement rates

By the end, you'll have a backend that can analyze any YouTube channel and give you meaningful analytics. Pretty cool, right?

Project Setup: Getting Our Hands Dirty

First things first – let's set up our project structure. I'm a big fan of keeping things organized from day one (trust me, future you will thank present you).

Directory Structure

youtube_analytics/
├── app/
│   ├── __init__.py
│   ├── api/
│   │   ├── __init__.py
│   │   └── youtube.py
│   ├── models/
│   │   ├── __init__.py
│   │   └── video.py
│   ├── services/
│   │   ├── __init__.py
│   │   ├── data_processor.py
│   │   └── youtube_service.py
│   ├── tasks/
│   │   ├── __init__.py
│   │   └── update_data.py
│   └── utils/
│       ├── __init__.py
│       └── database.py
├── config/
│   └── config.py
├── requirements.txt
├── README.md
└── main.py

This structure keeps everything modular – services handle business logic, models define our data structure, and APIs expose endpoints. Clean and maintainable!

Installing Dependencies

Create a requirements.txt file with the following dependencies:

Flask==2.2.2
SQLAlchemy==1.4.31
mysqlclient==2.1.0
google-api-python-client==2.37.0
redis==4.1.4
nltk==3.6.7
alembic==1.7.6
gunicorn==20.1.0
pymysql==1.0.2
apscheduler==3.9.1
Werkzeug==2.2.2

Install the dependencies using pip:

pip install -r requirements.txt

Note: You'll need a YouTube Data API key from Google Cloud Console. It's free for reasonable usage limits!

API Integration: The Data Source

We'll create a service to interact with the YouTube Data API to fetch video details. This is where the magic begins!

app/services/youtube_service.py

import googleapiclient.discovery
from config.config import YOUTUBE_API_KEY
 
class YouTubeService:
    def __init__(self):
        self.youtube = googleapiclient.discovery.build(
            "youtube", "v3", developerKey=YOUTUBE_API_KEY
        )
 
    def get_channel_videos(self, channel_id, max_results=50):
        request = self.youtube.search().list(
            part="id,snippet",
            channelId=channel_id,
            type="video",
            order="date",
            maxResults=max_results
        )
        response = request.execute()
        return response.get('items', [])
 
    def get_video_details(self, video_id):
        request = self.youtube.videos().list(
            part="snippet,statistics",
            id=video_id
        )
        response = request.execute()
        items = response.get('items', [])
        if not items:
            raise ValueError("No video found with the provided video ID.")
        return items[0]

The YouTube API is pretty straightforward once you get the hang of it. We're fetching both basic video info and detailed statistics in separate calls because that's how the API is structured.

Data Processing: Where Insights Come Alive

Next, we'll create a service to process the fetched data. This is where we turn raw YouTube stats into meaningful insights!

app/services/data_processor.py

from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from datetime import datetime, timedelta
 
nltk.download('punkt')
nltk.download('stopwords')
 
class DataProcessor:
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
 
    def calculate_view_count_trend(self, video_data, days=7):
        now = datetime.utcnow()
        publish_date = datetime.strptime(video_data['snippet']['publishedAt'], '%Y-%m-%dT%H:%M:%SZ')
        days_since_publish = (now - publish_date).days
 
        if days_since_publish < days:
            days = days_since_publish
 
        views_per_day = int(video_data['statistics']['viewCount']) / days if days > 0 else 0
        return views_per_day
 
    def extract_top_keywords(self, text, top_n=10):
        words = word_tokenize(text.lower())
        words = [word for word in words if word.isalnum() and word not in self.stop_words]
        return Counter(words).most_common(top_n)
 
    def analyze_engagement(self, video_data):
        likes = int(video_data['statistics'].get('likeCount', 0))
        comments = int(video_data['statistics'].get('commentCount', 0))
        views = int(video_data['statistics'].get('viewCount', 0))
        engagement_rate = (likes + comments) / views if views > 0 else 0
        return {
            'likes': likes,
            'comments': comments,
            'engagement_rate': engagement_rate
        }
 
    def categorize_video(self, video_data):
        view_count = int(video_data['statistics'].get('viewCount', 0))
        if view_count > 1000000:
            return 'Viral'
        elif view_count > 100000:
            return 'Popular'
        elif view_count > 10000:
            return 'Trending'
        else:
            return 'Normal'
 
    def extract_additional_insights(self, video_data_list):
        total_likes = 0
        total_comments = 0
        total_views = 0
        highest_engagement_rate = 0
        most_engaging_video = None
 
        for video_data in video_data_list:
            engagement = self.analyze_engagement(video_data)
            total_likes += engagement['likes']
            total_comments += engagement['comments']
            total_views += video_data['statistics'].get('viewCount', 0)
 
            if engagement['engagement_rate'] > highest_engagement_rate:
                highest_engagement_rate = engagement['engagement_rate']
                most_engaging_video = video_data
 
        average_engagement_rate = (total_likes + total_comments) / total_views if total_views > 0 else 0
 
        return {
            'most_engaging_video': most_engaging_video,
            'average_engagement_rate': average_engagement_rate
        }
 
    def process_video_data(self, video_data):
        trends = self.calculate_view_count_trend(video_data)
        keywords = self.extract_top_keywords(video_data['snippet']['title'] + ' ' + video_data['snippet']['description'])
        engagement = self.analyze_engagement(video_data)
        category = self.categorize_video(video_data)
 
        return {
            'trends': trends,
            'keywords': keywords,
            'engagement': engagement,
            'category': category
        }

Heads-up: The engagement rate calculation here is simplified. In production, you might want to weight likes vs comments differently or include other metrics like shares.

Database Setup: Structuring Our Data

We'll use SQLAlchemy to define our database models and create the necessary tables. This is where all our processed insights will live.

app/models/video.py

from sqlalchemy import Column, Integer, String, DateTime, Float, Text
from sqlalchemy.ext.declarative import declarative_base
 
Base = declarative_base()
 
class Video(Base):
    __tablename__ = 'videos'
 
    id = Column(String, primary_key=True)
    title = Column(String)
    description = Column(String)
    published_at = Column(DateTime)
    view_count = Column(Integer)
    like_count = Column(Integer)
    comment_count = Column(Integer)
    engagement_rate = Column(Float)
    top_keywords = Column(Text)
 
    def to_dict(self):
        return {
            'id': self.id,
            'title': self.title,
            'description': self.description,
            'published_at': self.published_at,
            'view_count': self.view_count,
            'like_count': self.like_count,
            'comment_count': self.comment_count,
            'engagement_rate': self.engagement_rate,
            'top_keywords': self.top_keywords
        }

app/utils/database.py

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from config.config import DATABASE_URL
 
engine = create_engine(DATABASE_URL)
Session = sessionmaker(bind=engine)
 
def get_session():
    return Session()

config/config.py

import os
 
# YouTube API Configuration
YOUTUBE_API_KEY = os.environ.get('YOUTUBE_API_KEY', 'your-api-key-here')
 
# Database Configuration
DATABASE_URL = os.environ.get('DATABASE_URL', 'mysql+pymysql://user:password@localhost/youtube_analytics')
 
# Flask Configuration
DEBUG = os.environ.get('DEBUG', True)
SECRET_KEY = os.environ.get('SECRET_KEY', 'your-secret-key-here')

Creating Database Tables

Create a script create_db.py to initialize the database:

from sqlalchemy import create_engine
from app.models.video import Base
from config.config import DATABASE_URL
 
def create_tables():
    engine = create_engine(DATABASE_URL)
    Base.metadata.create_all(engine)
    print("Tables created successfully.")
 
if __name__ == "__main__":
    create_tables()

Run the script to create the tables:

python create_db.py

Data Pipeline: Keeping Things Fresh

We'll create a scheduled job to update video data and recalculate insights every 5 minutes using APScheduler. This automation is what makes the platform truly useful!

app/tasks/update_data.py

from app.services.youtube_service import YouTubeService
from app.services.data_processor import DataProcessor
from app.models.video import Video
from app.utils.database import get_session
import json
from datetime import datetime
from apscheduler.schedulers.background import BackgroundScheduler
 
def save_video_data(video_details):
    session = get_session()
    data_processor = DataProcessor()
    try:
        video_id = video_details['id']
 
        db_video = session.query(Video).get(video_id)
        if not db_video:
            db_video = Video(id=video_id)
 
        processed_data = data_processor.process_video_data(video_details)
 
        db_video.title = video_details['snippet']['title']
        db_video.description = video_details['snippet']['description']
        db_video.published_at = datetime.strptime(video_details['snippet']['publishedAt'], '%Y-%m-%dT%H:%M:%SZ')
        db_video.view_count = int(video_details['statistics'].get('viewCount', 0))
        db_video.like_count = processed_data['engagement']['likes']
        db_video.comment_count = processed_data['engagement']['comments']
        db_video.engagement_rate = processed_data['engagement']['engagement_rate']
        db_video.top_keywords = json.dumps(processed_data['keywords'])
        db_video.category = processed_data['category']
 
        session.add(db_video)
        session.commit()
    except Exception as e:
        session.rollback()
        raise e
    finally:
        session.close()
 
def update_video_data(channel_ids):
    print("Processing started")
    youtube_service = YouTubeService()
 
    try:
        for channel_id in channel_ids:
            videos = youtube_service.get_channel_videos(channel_id)
            print(len(videos))
            for video in videos:
                video_id = video['id']['videoId']
                video_details = youtube_service.get_video_details(video_id)
                save_video_data(video_details)
        return f"Successfully updated data for {len(channel_ids)} channels"
    except Exception as e:
        return f"Error updating data: {str(e)}"
 
def schedule_updates(channel_ids):
    scheduler = BackgroundScheduler()
    scheduler.add_job(lambda: update_video_data(channel_ids), 'interval', minutes=5)
    scheduler.start()

API Endpoints: Making Data Accessible

Let's create some basic API endpoints using Flask to expose all our processed data:

app/api/youtube.py

from flask import Blueprint, jsonify, request
from app.models.video import Video
from app.utils.database import get_session
from app.tasks.update_data import update_video_data, save_video_data
from app.services.youtube_service import YouTubeService
 
youtube_bp = Blueprint('youtube', __name__)
 
@youtube_bp.route('/videos', methods=['GET'])
def get_videos():
    session = get_session()
    videos = session.query(Video).all()
    return jsonify([video.to_dict() for video in videos])
 
@youtube_bp.route('/videos/<video_id>', methods=['GET'])
def get_video(video_id):
    session = get_session()
    video = session.query(Video).get(video_id)
    if video:
        return jsonify(video.to_dict())
    return jsonify({'error': 'Video not found'}), 404
 
@youtube_bp.route('/update_videos', methods=['POST'])
def update_videos():
    channel_ids = request.json.get('channel_ids')
    if not channel_ids:
        return jsonify({'error': 'No channel IDs provided'}), 400
 
    update_video_data(channel_ids)
    return jsonify({'message': 'Video data update initiated'})
 
@youtube_bp.route('/youtube_video_details/<video_id>', methods=['GET'])
def get_youtube_video_details(video_id):
    youtube_service = YouTubeService()
    try:
        video_details = youtube_service.get_video_details(video_id)
        save_video_data(video_details)
        return jsonify(video_details)
    except Exception as e:
        return jsonify({'error': str(e)}), 500
 
@youtube_bp.route('/video_insights/<video_id>', methods=['GET'])
def get_video_insights(video_id):
    youtube_service = YouTubeService()
    data_processor = DataProcessor()
    try:
        video_details = youtube_service.get_video_details(video_id)
        insights = data_processor.process_video_data(video_details)
        return jsonify(insights)
    except Exception as e:
        return jsonify({'error': str(e)}), 500

Main Application File

Create the main Flask application:

# main.py
from flask import Flask
from app.api.youtube import youtube_bp
 
app = Flask(__name__)
app.register_blueprint(youtube_bp, url_prefix='/api')
 
if __name__ == '__main__':
    app.run(debug=True)

Running the Application: Let's See It Work!

Ensure your application is running:

python main.py

Example Requests: Testing Our Backend

Here are some sample API calls you can try to test everything:

Get All Videos:

curl http://localhost:5000/api/videos

Get Video Details:

curl http://localhost:5000/api/videos/VIDEO_ID

Update Videos:

curl -X POST -H "Content-Type: application/json" -d '{"channel_ids":["UCjwVmfw_BO-59EMp3i7Z8yg"]}' http://localhost:5000/api/update_videos

Get YouTube Video Details and Save to Database:

curl http://localhost:5000/api/youtube_video_details/VIDEO_ID

Get Video Insights:

curl http://localhost:5000/api/video_insights/VIDEO_ID

Replace VIDEO_ID with an actual YouTube video ID – try using "dQw4w9WgXcQ" for Rick Astley's classic hit! 😄

Sample API Responses: What You'll Get Back

Here's what the API responses look like in action:

Get Video Details Response

{
  "id": "dQw4w9WgXcQ",
  "title": "Rick Astley - Never Gonna Give You Up (Official Video)",
  "description": "The official video for "Never Gonna Give You Up" by Rick Astley...",
  "published_at": "2009-10-25T06:57:33Z",
  "view_count": 1400000000,
  "like_count": 14000000,
  "comment_count": 2800000,
  "engagement_rate": 0.012,
  "top_keywords": "[['rick', 15], ['astley', 12], ['music', 8], ['video', 6]]"
}

Get Video Insights Response

{
  "trends": 95890.41,
  "keywords": [
    ["rick", 15],
    ["astley", 12], 
    ["music", 8],
    ["video", 6],
    ["official", 4]
  ],
  "engagement": {
    "likes": 14000000,
    "comments": 2800000,
    "engagement_rate": 0.012
  },
  "category": "Viral"
}

Get All Videos Response

[
  {
    "id": "dQw4w9WgXcQ",
    "title": "Rick Astley - Never Gonna Give You Up (Official Video)",
    "view_count": 1400000000,
    "engagement_rate": 0.012,
    "category": "Viral"
  },
  {
    "id": "another_video_id",
    "title": "Another Video Title",
    "view_count": 500000,
    "engagement_rate": 0.008,
    "category": "Popular"
  }
]

Pretty neat to see those engagement rates and categorizations in action!

What's Next: Scaling & Improvements

This backend is a solid foundation, but there's always room for improvement. Here are some ideas for taking it to the next level:

1. Caching Layer

Add Redis for faster data access on frequently requested videos.

2. Advanced Analytics

Sentiment analysis on comments
Trend prediction algorithms
Channel comparison metrics

3. Real-time Features

WebSocket connections for live updates
Push notifications for viral videos

4. Better Error Handling

Retry mechanisms for API failures
Graceful degradation when YouTube API is down

Key Takeaways

Start simple, scale smart – We built a working system first, then identified improvement opportunities
Separate concerns – Services handle business logic, models define data, APIs expose functionality
Process data meaningfully – Raw metrics are just numbers; insights come from proper analysis
Automate everything – The scheduled pipeline keeps data fresh without manual intervention
Plan for growth – Modular structure makes it easy to add features and scale components

Wrapping Up

Building this YouTube analytics backend was a fantastic learning experience! We've created something that can genuinely provide value – whether you're a content creator wanting to analyze your performance or a business looking to understand YouTube trends.

The combination of Python's excellent libraries with YouTube's robust API makes for a powerful analytics platform. The best part? Everything we've built is easily extensible.

Give it a try and let me know what insights you discover! Drop a comment below if you build something cool with this backend or if you have questions about any part of the implementation. 👍

Source Code: Complete implementation available with additional features for those interested in extending the platform further.

Happy coding! 🚀