How to Get off WordPress

3. Data Extraction from WordPress

Export posts, metadata, taxonomies, authors, and media with deterministic mapping.

3.1 Exporting content

Start with WordPress REST API for standard entities. Use direct database access when API coverage is incomplete.

Export targets

Posts and pages
Custom post types
Categories and tags
Custom taxonomies
Authors and publish states
Slugs and permalinks
SEO plugin metadata (title, description, canonical overrides, social images) when applicable

REST API baseline

curl "https://example.com/wp-json/wp/v2/posts?per_page=100&page=1"

Practical REST constraints (important)

per_page is capped, so you must paginate until you exhaust results.
Drafts, private posts, and some fields require authentication and the correct request context.
Menus, redirects, SEO metadata, and builder-specific data are often not available in core REST endpoints without extra work.

Database fallback extraction

Use SQL for complex metadata joins, legacy builders, and unpublished states.

SELECT p.ID, p.post_type, p.post_status, p.post_name, p.post_title, p.post_date,
       pm.meta_key, pm.meta_value
FROM wp_posts p
LEFT JOIN wp_postmeta pm ON pm.post_id = p.ID
WHERE p.post_type IN ('post', 'page', 'case_study');

Do not miss these high-impact datasets

Most migration pain comes from forgetting one of these categories:

SEO plugin metadata in wp_postmeta (title, description, canonical overrides, social images)
Redirect rules stored in plugins or server config
Navigation menus and menu item hierarchies
Custom fields (ACF and similar) that hold core page content
Gutenberg reusable blocks (if used) and any shared patterns
Builder content stored in post meta instead of post_content

3.2 Media extraction

Media is usually in wp-content/uploads with date-based folder structure.

Goals

Preserve file names and relative paths when possible
Preserve alt text, captions, and featured image associations
Generate deterministic mapping from old URLs to new CDN URLs

Mapping approach

Copy media to staging storage.
Upload to B2 with normalized key strategy.
Record mapping in a CSV or table: old_url,new_url,file_hash.

old_url,new_url,file_hash
https://site.com/wp-content/uploads/2021/09/hero.jpg,https://cdn.site.com/media/2021/09/hero.jpg,abc123

3.3 Content normalization

Raw WordPress content often contains legacy artifacts.

Common cleanup tasks

Remove unsupported shortcodes
Convert embeds to explicit components
Strip page builder wrappers and inline style noise
Normalize heading hierarchy
Normalize internal links and image references to the new URL scheme

Gutenberg block handling (common case)

If the site uses Gutenberg heavily, treat content normalization as block transformation, not raw HTML cleanup.

What to watch for:

Blocks that map cleanly to structured fields (headings, paragraphs, lists, images)
Blocks that should become explicit components (callouts, tabs, accordions, galleries)
Custom blocks provided by plugins or themes (these are the risky ones)

A practical approach is:

Parse blocks from WordPress.
Convert the block tree into your destination model (Payload blocks or MDX).
Render and diff a sample set of pages for editorial review.

Transformation outcomes

Markdown or MDX for mostly narrative content
Structured CMS fields for reusable content blocks

Safe transformation workflow

Export raw HTML to intermediate JSON.
Run deterministic transforms with tests.
Save both original and transformed output for auditability.
Add manual review queue for outliers.

Validation checks

Slug uniqueness in destination system
Expected word count variance per page
Link integrity for internal and external links
Featured images and embeds render correctly
Redirect coverage for every indexed URL you care about