Diff (and collaborate on) Microsoft Word documents using GitHub

TL;DR: Word Diff empowers you to be a Markdown person in a Microsoft Word world by automatically converting Word documents to Markdown each time you commit to GitHub, where they can be diffed, versioned, and collaborated on internally.
3 minute read

Being a lawyer and a developer, it can be frustrating to see great collaboration tools like version control being snubbed, in favor of the lowest common collaborative denominator: emailing Microsoft Word documents with ambiguously versioned filenames.

At GitHub we use Git and the pull request flow to collaborate on everything, not just software, but working with government agencies and outside counsel, we’re often forced to fall back to “the old way” of doing things. You’d be hard pressed to find a better way to troll a developer than to swap out things like distributed version control for in-document tracked changes emailed back and forth.

What did you change in the last round of redlines? How do I know you didn’t just turn off track changes and make me promise to sell you my soul? How can my coworkers propose a change to the document? What metadata am I inadvertently sending to you in this black box? Why is there all this formatting noise for such as simple change? And most importantly, what happens when the file inevitably becomes corrupt?

Geeks solved this problem decades ago. It’s called version control. We know that if so much as a single character is off, entire programs come crashing to a halt. That’s why we have evolved collaboration tools like Git, to track every change, both proposed or realized, from both internal and external collaborators, but, the business world hasn’t followed quite the same path for various reasons. That’s where Word Diff comes in.

Let say you’re working on a Word Document, and have made three rounds of changes, committing the file to a Git repository after each round. Your change history might look something like this:

Changes to a .docx file

Normally, if you were to try to view a single commit to review what’s changed, due to Word’s black-box nature, the best you could get would likely be something like this:

binary diff

That’s because, despite its widespread adoption, a Microsoft Word documents is, in reality, a proprietary and purpose-built legacy format, and one that that’s especially hard to use outside Microsoft Word. And that’s the exact problem Word Diff aims to solve.

Word Diff empowers you to be a Markdown person in a Microsoft Word world.

As you work, Word Diff sits on a server (in my case Heroku), waiting for you to push your changes. When you do, it springs into action, automatically converting the Word document to Markdown after each commit:

Changes to a .md file

You’ll notice that for each change I made to important-file.docx, Word Diff made that same change to important-file.md, crediting me as the author, and preserving my original commit message, as it transparently committed an updated Markdown file to the repository after each change to the Word document. That way Git functions you’re used to — like blame and a file’s commit history — work just as you’d expect.

If you were to click the first commit, you’d see exactly what was changed, without the need to download a large, proprietary file; leave the safety and comfort of your browser; or fire up slow desktop software:

Formatting Diff

But last February, GitHub introduced rendered prose diffs to better visualize changes to human-readable text. If we click on the second commit, you can see exactly what content was changed in its rendered form:

Content Diff

Heck, you could even do a split diff if you really wanted to dig into things:

Split diff

At each iteration, I simply committed the Word document (either via command-line, or via the visual interface of GitHub for Mac/Windows), and Word Diff silently took care of the rest. I never touched the Markdown file (or bothered to convert things to Markdown myself).

When would you use this? Lets say you’re collaborating on a document with someone. Normally, you’d email Word documents with tracked changes back and forth (or use real-time collaborative editing tools that don’t really capture process). With Word Diff you can use Git’s native cryptographic diff functionality - which ensures the authenticity and integrity of a document - to quickly verify what’s changed in a given iteration, or compare different versions of the document over time, all with a single click.

More importantly, you can collaborate using the simple tools you love like Markdown and Git, all the while, the person you’re collaborating with will be none the wiser (using something like Pandoc or LibreOffice to convert things back to Word, if you must).

It’s still a bit rough around the edges, but if you’re interested in giving it a try, you can follow these instructions to set up your own instance of Word Diff, or can take a peek under the hood by looking at the Word to Markdown Ruby Gem. This is very much just the start. I’d love your feedback (and help making it better).

Edit: Looking to redline a document with the GitHub uninitiated? Check out Redliner.

Originally published February 6, 2015 | View revision history

If you enjoyed this post, you might also enjoy:

benbalter

Ben Balter is the Director of Hubber Engagement within the Office of the COO at GitHub, the world’s largest software development platform, ensuring all Hubbers can do their best (remote) work. Previously, he served as the Director of Technical Business Operations, and as Chief of Staff for Security, he managed the office of the Chief Security Officer, improving overall business effectiveness of the Security organization through portfolio management, strategy, planning, culture, and values. As a Staff Technical Program manager for Enterprise and Compliance, Ben managed GitHub’s on-premises and SaaS enterprise offerings, and as the Senior Product Manager overseeing the platform’s Trust and Safety efforts, Ben shipped more than 500 features in support of community management, privacy, compliance, content moderation, product security, platform health, and open source workflows to ensure the GitHub community and platform remained safe, secure, and welcoming for all software developers. Before joining GitHub’s Product team, Ben served as GitHub’s Government Evangelist, leading the efforts to encourage more than 2,000 government organizations across 75 countries to adopt open source philosophies for code, data, and policy development. More about the author →

This page is open source. Please help improve it.

Edit