How to Get off WordPress
3. Data Extraction from WordPress
Export posts, metadata, taxonomies, authors, and media with deterministic mapping.
3.1 Exporting content
Start with WordPress REST API for standard entities. Use direct database access when API coverage is incomplete.
Export targets
- Posts and pages
- Custom post types
- Categories and tags
- Custom taxonomies
- Authors and publish states
- Slugs and permalinks
- SEO plugin metadata (title, description, canonical overrides, social images) when applicable
REST API baseline
curl "https://example.com/wp-json/wp/v2/posts?per_page=100&page=1"
Practical REST constraints (important)
per_pageis capped, so you must paginate until you exhaust results.- Drafts, private posts, and some fields require authentication and the correct request context.
- Menus, redirects, SEO metadata, and builder-specific data are often not available in core REST endpoints without extra work.
Database fallback extraction
Use SQL for complex metadata joins, legacy builders, and unpublished states.
SELECT p.ID, p.post_type, p.post_status, p.post_name, p.post_title, p.post_date,
pm.meta_key, pm.meta_value
FROM wp_posts p
LEFT JOIN wp_postmeta pm ON pm.post_id = p.ID
WHERE p.post_type IN ('post', 'page', 'case_study');
Do not miss these high-impact datasets
Most migration pain comes from forgetting one of these categories:
- SEO plugin metadata in
wp_postmeta(title, description, canonical overrides, social images) - Redirect rules stored in plugins or server config
- Navigation menus and menu item hierarchies
- Custom fields (ACF and similar) that hold core page content
- Gutenberg reusable blocks (if used) and any shared patterns
- Builder content stored in post meta instead of
post_content
3.2 Media extraction
Media is usually in wp-content/uploads with date-based folder structure.
Goals
- Preserve file names and relative paths when possible
- Preserve alt text, captions, and featured image associations
- Generate deterministic mapping from old URLs to new CDN URLs
Mapping approach
- Copy media to staging storage.
- Upload to B2 with normalized key strategy.
- Record mapping in a CSV or table:
old_url,new_url,file_hash.
old_url,new_url,file_hash
https://site.com/wp-content/uploads/2021/09/hero.jpg,https://cdn.site.com/media/2021/09/hero.jpg,abc123
3.3 Content normalization
Raw WordPress content often contains legacy artifacts.
Common cleanup tasks
- Remove unsupported shortcodes
- Convert embeds to explicit components
- Strip page builder wrappers and inline style noise
- Normalize heading hierarchy
- Normalize internal links and image references to the new URL scheme
Gutenberg block handling (common case)
If the site uses Gutenberg heavily, treat content normalization as block transformation, not raw HTML cleanup.
What to watch for:
- Blocks that map cleanly to structured fields (headings, paragraphs, lists, images)
- Blocks that should become explicit components (callouts, tabs, accordions, galleries)
- Custom blocks provided by plugins or themes (these are the risky ones)
A practical approach is:
- Parse blocks from WordPress.
- Convert the block tree into your destination model (Payload blocks or MDX).
- Render and diff a sample set of pages for editorial review.
Transformation outcomes
- Markdown or MDX for mostly narrative content
- Structured CMS fields for reusable content blocks
Safe transformation workflow
- Export raw HTML to intermediate JSON.
- Run deterministic transforms with tests.
- Save both original and transformed output for auditability.
- Add manual review queue for outliers.
Validation checks
- Slug uniqueness in destination system
- Expected word count variance per page
- Link integrity for internal and external links
- Featured images and embeds render correctly
- Redirect coverage for every indexed URL you care about