How to Convert Your Blog Posts to Markdown for AI Analysis
What You’ll Learn
By the end of this tutorial, you’ll be able to:
- Extract all blog posts from your sitemap automatically
- Convert HTML content to clean markdown
- Preserve metadata (author, dates, categories, descriptions)
- Structure posts with YAML frontmatter for easy LLM analysis
- Combine posts intelligently for batch analysis
Time investment: 30 minutes setup, 2 minutes runtime for 100+ posts
Technical level: Basic command line knowledge required
Prerequisites
- Python 3.7+ installed on your computer
- Access to your blog’s sitemap (usually at yoursite.com/sitemap.xml)
- Basic familiarity with terminal/command prompt
Platform agnostic: This works for any blog platform (WordPress, Webflow, Ghost, custom CMS) as long as you have a sitemap.
Step 1: Understand Your Blog Structure
Before writing any code, inspect your blog to understand:
A. Sitemap Location
Most blogs have a sitemap at:
https://yourblog.com/sitemap.xmlhttps://yourblog.com/sitemap_index.xmlhttps://yourblog.com/post-sitemap.xml
Check your robots.txt file (yourblog.com/robots.txt) to find your sitemap URL.
B. HTML Structure
Visit one blog post and inspect the HTML (right-click → “View Page Source”):
Find these elements:
- Main content container (usually has a class like
post-content,article-body,rich-text) - Title element (usually an
<h1>tag) - Author name
- Publication date
- Category/tags
- Featured image
Pro tip: Use browser DevTools (F12) to inspect elements and identify CSS selectors.
For example:
- Content:
<div class="text-rich-text"> - Title:
<h1 class="heading-style-h2"> - Author:
<p class="text-size-regular text-color-brandnile"> - Date:
<p class="text-size-regular text-color-grey">
Step 2: Set Up Your Environment
Create Project Folder
mkdir blog_converter
cd blog_converter
Install Python Dependencies
Create a requirements.txt file:
requests>=2.31.0
beautifulsoup4>=4.12.0
lxml>=4.9.0
markdownify>=0.11.6
PyYAML>=6.0.1
Install dependencies:
pip install -r requirements.txt
# Or on Mac: pip3 install -r requirements.txt
What each library does:
requests– Fetches web pagesbeautifulsoup4– Parses HTMLlxml– Fast XML/HTML parsermarkdownify– Converts HTML to MarkdownPyYAML– Handles YAML frontmatter
Step 3: Build the Converter Script
Create blog_to_markdown.py:
#!/usr/bin/env python3
"""
Blog Post to Markdown Converter
Extracts blog posts from sitemap and converts them to markdown with YAML frontmatter
"""
import re
import requests
from bs4 import BeautifulSoup
from markdownify import markdownify as md
from pathlib import Path
from datetime import datetime
import yaml
import time
class BlogConverter:
def __init__(self, sitemap_url, output_dir="blog_posts"):
self.sitemap_url = sitemap_url
self.output_dir = Path(output_dir)
self.output_dir.mkdir(exist_ok=True)
self.session = requests.Session()
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; BlogConverter/1.0)'
})
def fetch_sitemap(self):
"""Fetch and parse the sitemap to get all blog post URLs"""
print(f"Fetching sitemap from {self.sitemap_url}...")
response = self.session.get(self.sitemap_url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'xml')
urls = soup.find_all('loc')
# Filter only blog post URLs (adjust pattern for your blog)
blog_urls = [url.text for url in urls if '/blog-post/' in url.text]
print(f"Found {len(blog_urls)} blog posts")
return blog_urls
def fetch_page(self, url):
"""Fetch a single blog post page"""
response = self.session.get(url)
response.raise_for_status()
return BeautifulSoup(response.content, 'html.parser')
def extract_metadata(self, soup, url):
"""Extract metadata from the blog post"""
metadata = {
'url': url,
'slug': url.split('/blog-post/')[-1]
}
# Title - adjust selector for your blog
title_tag = soup.find('h1', class_='heading-style-h2')
metadata['title'] = title_tag.get_text(strip=True) if title_tag else 'Untitled'
# Meta description
meta_desc = soup.find('meta', attrs={'name': 'description'})
metadata['description'] = meta_desc['content'] if meta_desc else ''
# Category - adjust selector for your blog
category_div = soup.find('div', class_='w-dyn-list')
if category_div:
category_item = category_div.find('div', role='listitem')
if category_item:
metadata['category'] = category_item.get_text(strip=True)
# Author - adjust selector for your blog
author_tag = soup.find('p', class_='text-size-regular text-color-brandnile')
metadata['author'] = author_tag.get_text(strip=True) if author_tag else ''
# Dates - adjust selector for your blog
date_divs = soup.find_all('p', class_='text-size-regular text-color-grey')
for div in date_divs:
text = div.get_text(strip=True)
parent = div.find_parent('div', class_='date_flex')
if parent:
label = parent.find('div', class_='text-weight-medium')
if label:
label_text = label.get_text(strip=True)
if 'Published' in label_text:
metadata['published_date'] = text
elif 'Updated' in label_text:
metadata['updated_date'] = text
# Featured image
img_tag = soup.find('img', class_='blogtp_hero-banner')
if img_tag:
metadata['featured_image'] = img_tag.get('src', '')
# OG Image (fallback)
if not metadata.get('featured_image'):
og_image = soup.find('meta', property='og:image')
if og_image:
metadata['featured_image'] = og_image['content']
return metadata
def extract_content(self, soup):
"""Extract the main blog post content"""
# Adjust selector for your blog's content container
content_div = soup.find('div', class_='text-rich-text')
if not content_div:
return ""
# Remove unwanted elements
for element in content_div.find_all(['script', 'style']):
element.decompose()
# Convert to markdown
html_content = str(content_div)
markdown_content = md(html_content, heading_style="ATX", bullets="-")
# Clean up the markdown
markdown_content = self.clean_markdown(markdown_content)
return markdown_content
def clean_markdown(self, text):
"""Clean up markdown formatting"""
# Remove excessive newlines
text = re.sub(r'\n{3,}', '\n\n', text)
# Fix spacing around headers
text = re.sub(r'\n(#{1,6} )', r'\n\n\1', text)
text = re.sub(r'(#{1,6} .+)\n', r'\1\n\n', text)
# Remove leading/trailing whitespace
text = text.strip()
return text
def create_markdown_file(self, metadata, content):
"""Create a markdown file with YAML frontmatter"""
# Create frontmatter
frontmatter = {
'title': metadata.get('title', ''),
'slug': metadata.get('slug', ''),
'url': metadata.get('url', ''),
'description': metadata.get('description', ''),
'author': metadata.get('author', ''),
'category': metadata.get('category', ''),
'published_date': metadata.get('published_date', ''),
'updated_date': metadata.get('updated_date', ''),
'featured_image': metadata.get('featured_image', '')
}
# Remove empty values
frontmatter = {k: v for k, v in frontmatter.items() if v}
# Create the complete markdown document
markdown_doc = "---\n"
markdown_doc += yaml.dump(frontmatter, allow_unicode=True, sort_keys=False)
markdown_doc += "---\n\n"
markdown_doc += content
return markdown_doc
def save_markdown(self, slug, markdown_content):
"""Save the markdown file"""
# Sanitize filename
filename = re.sub(r'[^\w\-]', '_', slug)
filepath = self.output_dir / f"{filename}.md"
with open(filepath, 'w', encoding='utf-8') as f:
f.write(markdown_content)
return filepath
def process_post(self, url):
"""Process a single blog post"""
try:
print(f"Processing: {url}")
# Fetch the page
soup = self.fetch_page(url)
# Extract metadata
metadata = self.extract_metadata(soup, url)
# Extract content
content = self.extract_content(soup)
if not content:
print(f" ⚠️ Warning: No content found for {url}")
return None
# Create markdown document
markdown_doc = self.create_markdown_file(metadata, content)
# Save to file
filepath = self.save_markdown(metadata['slug'], markdown_doc)
print(f" ✓ Saved to: {filepath}")
return filepath
except Exception as e:
print(f" ✗ Error processing {url}: {str(e)}")
return None
def create_index(self, results):
"""Create an index file with all posts"""
index_path = self.output_dir / "INDEX.md"
with open(index_path, 'w', encoding='utf-8') as f:
f.write("# Blog Posts Index\n\n")
f.write(f"Total posts: {len(results)}\n\n")
f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
f.write("---\n\n")
for i, (url, filepath) in enumerate(results.items(), 1):
if filepath:
slug = filepath.stem
f.write(f"{i}. [{slug}]({filepath.name}) - {url}\n")
print(f"\n✓ Created index file: {index_path}")
def run(self, limit=None):
"""Run the conversion process"""
start_time = time.time()
# Fetch all blog URLs
blog_urls = self.fetch_sitemap()
# Limit for testing
if limit:
blog_urls = blog_urls[:limit]
print(f"Processing first {limit} posts only (testing mode)")
# Process each post
results = {}
for i, url in enumerate(blog_urls, 1):
print(f"\n[{i}/{len(blog_urls)}]")
filepath = self.process_post(url)
results[url] = filepath
# Be polite - add a small delay between requests
time.sleep(0.5)
# Create index
self.create_index(results)
# Summary
successful = sum(1 for fp in results.values() if fp is not None)
failed = len(results) - successful
elapsed = time.time() - start_time
print(f"\n{'='*60}")
print(f"Conversion Complete!")
print(f"{'='*60}")
print(f"Total posts: {len(results)}")
print(f"Successful: {successful}")
print(f"Failed: {failed}")
print(f"Time elapsed: {elapsed:.2f} seconds")
print(f"Output directory: {self.output_dir.absolute()}")
def main():
import argparse
parser = argparse.ArgumentParser(description='Convert blog posts to Markdown')
parser.add_argument('--sitemap', required=True,
help='Sitemap URL')
parser.add_argument('--output', default='blog_posts',
help='Output directory')
parser.add_argument('--limit', type=int,
help='Limit number of posts (for testing)')
args = parser.parse_args()
converter = BlogConverter(args.sitemap, args.output)
converter.run(limit=args.limit)
if __name__ == '__main__':
main()
Step 4: Customize for Your Blog
Critical: You must adjust the CSS selectors in extract_metadata() and extract_content() methods to match your blog’s HTML structure.
How to Find Your Selectors:
- Open any blog post on your site
- Right-click on the title → “Inspect”
- Note the class name (e.g.,
post-title,entry-title,article-heading) - Replace in the script:
# Change this:
title_tag = soup.find('h1', class_='heading-style-h2')
# To your selector:
title_tag = soup.find('h1', class_='post-title')
Repeat for:
- Content container
- Author
- Date
- Category
- Featured image
Common Selectors by Platform:
WordPress:
content_div = soup.find('div', class_='entry-content')
title_tag = soup.find('h1', class_='entry-title')
author_tag = soup.find('span', class_='author')
Ghost:
content_div = soup.find('div', class_='post-content')
title_tag = soup.find('h1', class_='post-title')
author_tag = soup.find('a', class_='author-name')
Medium:
content_div = soup.find('article')
title_tag = soup.find('h1')
author_tag = soup.find('a', attrs={'data-action': 'show-user-card'})
Step 5: Run the Converter
Test with Limited Posts First
python3 blog_to_markdown.py --sitemap https://yourblog.com/sitemap.xml --limit 3
Expected output:
Fetching sitemap from https://yourblog.com/sitemap.xml...
Found 150 blog posts
Processing first 3 posts only (testing mode)
[1/3]
Processing: https://yourblog.com/blog-post/example-post
✓ Saved to: blog_posts/example-post.md
[2/3]
Processing: https://yourblog.com/blog-post/another-post
✓ Saved to: blog_posts/another-post.md
[3/3]
Processing: https://yourblog.com/blog-post/third-post
✓ Saved to: blog_posts/third-post.md
✓ Created index file: blog_posts/INDEX.md
============================================================
Conversion Complete!
============================================================
Total posts: 3
Successful: 3
Failed: 0
Time elapsed: 4.52 seconds
Output directory: /Users/you/blog_converter/blog_posts
Verify Output
Check the blog_posts/ folder. Open one .md file:
---
title: How to Calculate ESOP Taxation in India
slug: esop-taxation-india
url: https://yourblog.com/blog-post/esop-taxation-india
description: Complete guide to ESOP taxation in India...
author: Jane Doe
category: ESOP Management
published_date: November 15, 2024
updated_date: November 20, 2024
featured_image: https://yourblog.com/images/esop-tax.jpg
---
## Understanding ESOP Taxation
Employee Stock Option Plans (ESOPs) are taxed at two stages...
[rest of content]
Good signs:
- ✅ YAML frontmatter is properly formatted
- ✅ All metadata fields are populated
- ✅ Content is clean markdown (no HTML tags)
- ✅ Headings, lists, and links are preserved
Red flags:
- ❌ Empty content body
- ❌ HTML tags in content (
<div>,<p>) - ❌ Missing metadata
- ❌ Garbled characters
If you see red flags, revisit Step 4 and adjust your selectors.
Run Full Conversion
Once you’re satisfied with the test:
python3 blog_to_markdown.py --sitemap https://yourblog.com/sitemap.xml
For 100 posts, expect 1-2 minutes runtime.
Step 6: Combine Posts for LLM Analysis
Create combine_posts.py:
#!/usr/bin/env python3
"""
Smart Blog Post Combiner
Combines all markdown files into one organized document with TOC
"""
from pathlib import Path
import yaml
import re
from datetime import datetime
def extract_frontmatter(content):
"""Extract YAML frontmatter from markdown"""
if content.startswith('---'):
parts = content.split('---', 2)
if len(parts) >= 3:
try:
frontmatter = yaml.safe_load(parts[1])
body = parts[2].strip()
return frontmatter, body
except:
return {}, content
return {}, content
def sanitize_title(title):
"""Clean title for TOC links"""
sanitized = title.lower()
sanitized = re.sub(r'[^\w\s-]', '', sanitized)
sanitized = re.sub(r'[-\s]+', '-', sanitized)
return sanitized
def combine_posts(input_dir='blog_posts', output_file='all_posts_combined.md'):
"""Combine all posts into one smart document"""
input_path = Path(input_dir)
# Get all markdown files except INDEX.md
md_files = [f for f in input_path.glob('*.md') if f.name != 'INDEX.md']
if not md_files:
print(f"No markdown files found in {input_dir}/")
return
print(f"Found {len(md_files)} blog posts")
print("Reading and organizing posts...")
# Read all posts and extract metadata
posts = []
for md_file in md_files:
content = md_file.read_text(encoding='utf-8')
metadata, body = extract_frontmatter(content)
posts.append({
'filename': md_file.name,
'metadata': metadata,
'body': body,
'title': metadata.get('title', md_file.stem)
})
# Sort posts by published date (most recent first)
def get_date(post):
date_str = post['metadata'].get('published_date', '')
try:
for fmt in ['%B %d, %Y', '%b %d, %Y', '%Y-%m-%d']:
try:
return datetime.strptime(date_str, fmt)
except:
continue
except:
pass
return datetime.min
posts.sort(key=get_date, reverse=True)
print("Creating combined document...")
# Create the combined document
output = []
# Header
output.append("# Complete Blog Collection\n")
output.append(f"**Total Posts:** {len(posts)} ")
output.append(f"**Generated:** {datetime.now().strftime('%B %d, %Y at %H:%M:%S')}\n")
output.append("---\n")
# Table of Contents
output.append("## 📑 Table of Contents\n")
# Group by category
by_category = {}
no_category = []
for i, post in enumerate(posts, 1):
category = post['metadata'].get('category', '')
if category:
if category not in by_category:
by_category[category] = []
by_category[category].append((i, post))
else:
no_category.append((i, post))
# Write TOC by category
for category in sorted(by_category.keys()):
output.append(f"\n### {category}\n")
for idx, post in by_category[category]:
title = post['title']
anchor = sanitize_title(title)
output.append(f"{idx}. [{title}](#{anchor})\n")
if no_category:
output.append(f"\n### Other Posts\n")
for idx, post in no_category:
title = post['title']
anchor = sanitize_title(title)
output.append(f"{idx}. [{title}](#{anchor})\n")
output.append("\n---\n")
output.append("\n# 📝 Blog Posts\n")
# Add each post with clear separators
for i, post in enumerate(posts, 1):
metadata = post['metadata']
title = post['title']
# Post separator
output.append(f"\n\n{'='*80}\n")
output.append(f"## Post #{i}: {title}\n")
output.append(f"{'='*80}\n\n")
# Metadata box
output.append("**Metadata:**\n")
output.append("```yaml\n")
meta_items = []
if metadata.get('slug'):
meta_items.append(f"Slug: {metadata['slug']}")
if metadata.get('author'):
meta_items.append(f"Author: {metadata['author']}")
if metadata.get('category'):
meta_items.append(f"Category: {metadata['category']}")
if metadata.get('published_date'):
meta_items.append(f"Published: {metadata['published_date']}")
if metadata.get('updated_date'):
meta_items.append(f"Updated: {metadata['updated_date']}")
if metadata.get('url'):
meta_items.append(f"URL: {metadata['url']}")
output.append('\n'.join(meta_items))
output.append("\n```\n\n")
if metadata.get('description'):
output.append(f"**Summary:** {metadata['description']}\n\n")
output.append("---\n\n")
# Post content
output.append(post['body'])
output.append("\n\n")
# Footer
output.append("\n\n")
output.append("="*80 + "\n")
output.append(f"**End of Collection** - {len(posts)} posts total\n")
output.append("="*80 + "\n")
# Write to file
output_path = Path(output_file)
output_path.write_text(''.join(output), encoding='utf-8')
size_mb = output_path.stat().st_size / (1024 * 1024)
print(f"\n✓ Successfully combined {len(posts)} posts!")
print(f"✓ Output file: {output_path.absolute()}")
print(f"✓ File size: {size_mb:.2f} MB")
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(description='Combine blog posts')
parser.add_argument('--input', default='blog_posts',
help='Input directory')
parser.add_argument('--output', default='all_posts_combined.md',
help='Output filename')
args = parser.parse_args()
combine_posts(args.input, args.output)
Run the Combiner
python3 combine_posts.py
This creates all_posts_combined.md with:
- Table of contents organized by category
- Clear post separators
- Metadata boxes for each post
- Chronologically sorted (newest first)
Advanced: Scheduling Automated Updates
[Author’s note: This particular section is untested as of now.]
Want to keep your markdown files in sync as you publish new posts? Set up a cron job or GitHub Action.
Cron Job (Mac/Linux)
# Edit crontab
crontab -e
# Add line to run weekly on Sundays at 2 AM
0 2 * * 0 cd /path/to/blog_converter && python3 blog_to_markdown.py --sitemap https://yourblog.com/sitemap.xml
GitHub Action
Create .github/workflows/blog-sync.yml:
name: Sync Blog Posts
on:
schedule:
- cron: '0 2 * * 0' # Weekly on Sundays
workflow_dispatch: # Manual trigger
jobs:
sync:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run converter
run: python blog_to_markdown.py --sitemap https://yourblog.com/sitemap.xml
- name: Commit changes
run: |
git config --local user.email "action@github.com"
git config --local user.name "GitHub Action"
git add blog_posts/
git commit -m "Auto-sync blog posts" || echo "No changes"
git push
Troubleshooting
“No content found” errors
Problem: Content div selector is wrong
Solution:
- Inspect your blog post HTML
- Find the main content container
- Update
extract_content()method with correct selector
“Command not found: python”
Problem: Wrong Python command on Mac
Solution: Use python3 instead of python
“403 Forbidden” errors
Problem: Site blocking the scraper
Solution:
- Add custom User-Agent (already in script)
- Check robots.txt for scraping rules
- Contact your hosting provider if you own the site
Metadata fields are empty
Problem: Selectors don’t match your HTML structure
Solution: Update all selectors in extract_metadata() to match your blog’s structure
Best Practices
1. Test Before Full Run
Always use --limit 3 first to verify selectors work correctly.
2. Respect Rate Limits
The script includes 0.5s delays between requests. Don’t remove these—be a good internet citizen.
3. Version Control Your Output
git init
git add blog_posts/
git commit -m "Initial blog export"
This lets you track content changes over time.
4. Document Your Selectors
Keep a comment at the top of your script:
"""
BLOG-SPECIFIC SELECTORS (Updated: 2024-11-26)
- Content: div.text-rich-text
- Title: h1.heading-style-h2
- Author: p.text-size-regular.text-color-brandnile
- Date: p.text-size-regular.text-color-grey
"""
5. Regular Audits
Run monthly to catch:
- New posts to analyze
- Changed HTML structure (update selectors)
- Broken image links
- Outdated content
Extending the Script
Add Reading Time Calculation
def calculate_reading_time(content):
"""Calculate estimated reading time"""
words = len(content.split())
minutes = round(words / 200) # Average reading speed
return f"{minutes} min read"
Extract Internal Links
def extract_internal_links(soup, base_url):
"""Find all internal links for link graph analysis"""
links = []
for a in soup.find_all('a', href=True):
href = a['href']
if base_url in href:
links.append({
'text': a.get_text(strip=True),
'url': href
})
return links
Generate Content Calendar
def analyze_publishing_frequency(posts):
"""Analyze posting patterns"""
dates = [p['metadata'].get('published_date') for p in posts]
# Parse dates and calculate frequency
# Suggest optimal posting schedule
Conclusion
Converting your blog to markdown unlocks powerful AI-driven content analysis. What used to take days of manual auditing now takes minutes.
The key is getting clean, structured data that LLMs can parse effectively. With proper metadata extraction and YAML frontmatter, you can:
- Identify content gaps at scale
- Optimize SEO systematically
- Plan content calendars data-driven
- Maintain content quality consistently
- Scale content operations efficiently
Time investment: 30 minutes setup.