batch automation pandoc word markdown

Batch Convert Word to Markdown: Scripts for Bulk Document Conversion

How to batch convert multiple Word (.docx) files to Markdown at once. Covers Pandoc scripts for Windows, macOS, and Linux, plus Python automation for advanced use cases.

W
WordToMD Team
·

Converting a single Word document to Markdown takes seconds with WordToMD. But what if you have 50, 100, or 500 .docx files to convert? That’s where batch conversion scripts come in. This guide covers the fastest approaches for bulk Word to Markdown conversion.

When You Need Batch Conversion

If you have:

  • A legacy documentation library in Word format
  • Weekly reports that need to become Markdown pages
  • A content migration project (Word → static site, wiki, or knowledge base)
  • Multiple authors submitting Word docs that publish to Markdown

You need batch conversion. For single files, WordToMD remains the easiest option.

Pandoc: The Batch Conversion Backbone

Pandoc is the best tool for batch conversion. Install it once, then script it for any volume of files.

Windows PowerShell Script

# convert-all.ps1
# Converts all .docx files in current directory to .md

$docxFiles = Get-ChildItem -Path "." -Filter "*.docx" -Recurse

foreach ($file in $docxFiles) {
    $outputPath = [System.IO.Path]::ChangeExtension($file.FullName, ".md")
    
    pandoc `
        $file.FullName `
        -t gfm `
        --wrap=none `
        --extract-media="./media" `
        -o $outputPath
    
    Write-Host "✓ $($file.Name)$([System.IO.Path]::GetFileName($outputPath))"
}

Write-Host "Done. Converted $($docxFiles.Count) files."

Run it:

.\convert-all.ps1

Bash Script (macOS / Linux)

#!/bin/bash
# convert-all.sh
# Converts all .docx files recursively

OUTPUT_DIR="./markdown-output"
mkdir -p "$OUTPUT_DIR"

find . -name "*.docx" | while read -r docx_file; do
    filename=$(basename "$docx_file" .docx)
    output="$OUTPUT_DIR/$filename.md"
    
    pandoc \
        "$docx_file" \
        -t gfm \
        --wrap=none \
        --extract-media="$OUTPUT_DIR/media" \
        -o "$output"
    
    echo "✓ $docx_file$output"
done

echo "Done."

Run it:

chmod +x convert-all.sh
./convert-all.sh

Python Script for Advanced Use Cases

Python’s python-docx library and Pandoc subprocess approach gives more control:

#!/usr/bin/env python3
"""
batch_convert.py — Convert all .docx files to Markdown with frontmatter injection
"""

import os
import subprocess
import sys
from pathlib import Path
from datetime import datetime

INPUT_DIR = Path("./word-docs")
OUTPUT_DIR = Path("./markdown-output")
OUTPUT_DIR.mkdir(exist_ok=True)

def convert_file(docx_path: Path) -> Path:
    """Convert a .docx file to GFM Markdown using Pandoc."""
    output_path = OUTPUT_DIR / docx_path.with_suffix(".md").name
    
    result = subprocess.run(
        [
            "pandoc",
            str(docx_path),
            "-t", "gfm",
            "--wrap=none",
            f"--extract-media={OUTPUT_DIR}/media",
            "-o", str(output_path)
        ],
        capture_output=True,
        text=True
    )
    
    if result.returncode != 0:
        print(f"  ERROR: {result.stderr}", file=sys.stderr)
        return None
    
    return output_path

def add_frontmatter(md_path: Path, title: str) -> None:
    """Prepend YAML frontmatter to a Markdown file."""
    frontmatter = f"""---
title: "{title}"
date: {datetime.now().strftime('%Y-%m-%d')}
draft: false
---

"""
    content = md_path.read_text(encoding="utf-8")
    md_path.write_text(frontmatter + content, encoding="utf-8")

def main():
    docx_files = list(INPUT_DIR.glob("**/*.docx"))
    
    if not docx_files:
        print(f"No .docx files found in {INPUT_DIR}")
        return
    
    print(f"Converting {len(docx_files)} files...")
    
    success = 0
    for docx_path in docx_files:
        print(f"  Converting: {docx_path.name}")
        
        output_path = convert_file(docx_path)
        if output_path:
            # Use filename (without extension) as title, replace hyphens/underscores
            title = docx_path.stem.replace("-", " ").replace("_", " ").title()
            add_frontmatter(output_path, title)
            print(f"  ✓ → {output_path.name}")
            success += 1
        else:
            print(f"  ✗ Failed: {docx_path.name}")
    
    print(f"\nDone: {success}/{len(docx_files)} files converted.")

if __name__ == "__main__":
    main()

Run it:

python3 batch_convert.py

Adding Frontmatter Automatically

Static site generators need frontmatter. The Python script above adds basic frontmatter. For a more complete approach, extract the document title from the first H1 heading:

def extract_title(md_content: str) -> str:
    """Extract the first H1 heading as the title."""
    for line in md_content.splitlines():
        if line.startswith("# "):
            return line[2:].strip()
    return "Untitled"

Then update the add_frontmatter call to use the extracted title.

Preserving Directory Structure

When converting a nested folder structure, maintain the hierarchy:

# Bash: preserve subdirectory structure
find ./word-docs -name "*.docx" | while read -r docx; do
    # Calculate relative path
    rel_path="${docx#./word-docs/}"
    output_dir="./markdown-output/$(dirname "$rel_path")"
    mkdir -p "$output_dir"
    
    output="$output_dir/$(basename "$docx" .docx).md"
    pandoc "$docx" -t gfm --wrap=none -o "$output"
    echo "✓ $rel_path"
done

Image Handling in Batch Conversion

With --extract-media, Pandoc extracts all images to a folder. The challenge: multiple .docx files may produce files with identical names (e.g., image1.png). Use a per-document subdirectory:

for docx_file in *.docx; do
    slug="${docx_file%.docx}"
    mkdir -p "./media/$slug"
    
    pandoc "$docx_file" \
        -t gfm \
        --wrap=none \
        --extract-media="./media/$slug" \
        -o "$slug.md"
done

Performance for Large Sets

For large document sets (100+ files):

  • Pandoc processes one file per invocation — this is fine for up to ~100 files
  • For 500+ files, use GNU Parallel for concurrent processing:
find . -name "*.docx" | \
parallel pandoc {} -t gfm --wrap=none -o {.}.md

Install GNU Parallel: brew install parallel / sudo apt install parallel

Validating Output Quality

After batch conversion, spot-check your output:

  1. Random sample — Open 5-10 converted files and compare to the originals
  2. Table count — Count tables in a document and verify they all converted
  3. Heading levels — Ensure heading hierarchy is preserved
  4. Broken links — Search for Markdown links and verify they resolve

A simple check script:

# Count files with no headings (might indicate conversion failure)
for md in ./markdown-output/*.md; do
    if ! grep -q "^#" "$md"; then
        echo "No headings found: $md"
    fi
done

FAQ

Can I batch convert .doc files (old Word format)? Save them as .docx first. Word’s macro feature can batch-save: Tools → Macros → run a SaveAs macro across all open files. LibreOffice can also batch-convert via command line: soffice --headless --convert-to docx *.doc.

How long does batch conversion take? Pandoc converts a typical 10-page document in under a second. 100 files = roughly 1-2 minutes. Very large documents (100+ pages) take longer.

I’m getting “pandoc: command not found” in my script. Make sure Pandoc is installed and on your PATH. Run which pandoc (macOS/Linux) or Get-Command pandoc (PowerShell) to check.

Can I run batch conversion on Windows without PowerShell? Yes — use a .bat file or install Git Bash to run the bash scripts.

Conclusion

Batch Word to Markdown conversion is straightforward with Pandoc and a few lines of scripting. The PowerShell and Bash scripts above handle most scenarios. For custom frontmatter, directory structure preservation, or image handling, the Python script gives you full control. For one-off conversions, WordToMD remains the fastest option.