Scale customer reach and grow sales with AskHandle chatbot

The Hidden Complexity of Getting Structured Data Out of Word

If you’ve ever tried to “just extract the text” from a Word document and keep all the formatting, you’ve probably discovered it’s anything but “just.” Under the hood, Word is closer to a layout engine plus a tiny CMS than a simple text editor, and that makes faithful extraction surprisingly hard.

image-1
Written by
Published onFebruary 19, 2026
RSS Feed for BlogRSS Blog

The Hidden Complexity of Getting Structured Data Out of Word

If you’ve ever tried to “just extract the text” from a Word document and keep all the formatting, you’ve probably discovered it’s anything but “just.” Under the hood, Word is closer to a layout engine plus a tiny CMS than a simple text editor, and that makes faithful extraction surprisingly hard.

Word files are little ecosystems, not single documents

A modern .docx file is actually a ZIP containing a bunch of XML files. There’s one for the main document content, another for styles, another for numbering, others for images, footnotes, relationships, and so on. What you see in Word is assembled live from this collection rather than stored as one neat linear stream of text.

When you open a document, Word pulls in a paragraph from one place, styling information from another, numbering definitions from yet another, and then renders them together on the screen. To recreate that outside Word, a library has to understand and merge all of those pieces correctly. Most open source tools, understandably, only look at a subset of this structure, which is where things start to get lost.

Why numbered lists are such a troublemaker

Numbered lists are a great example of how different Word’s model is from what most people expect. You might assume that the “1.” or “2.” you see is just part of the text of that paragraph. In Word, it usually isn’t.

Instead, each list paragraph carries metadata that says “I belong to numbering scheme X at level Y.” The actual number (1, 2, 3… or I, II, III…, or 1.1, 1.2, etc.) is computed when the document is rendered, according to rules defined in the numbering part of the file. Those rules control:

  • The numbering format (decimal, roman numerals, letters, bullets).
  • The level in a multi‑level outline.
  • Whether numbering continues from a previous list or restarts.
  • Indentation and alignment details.

When a parser just reads the paragraph text, there is no literal “1.” to grab. Unless it also understands and applies the numbering definitions, it will see only a paragraph that happens to be flagged as “some list item.” Many open source packages therefore:

  • Treat numbered and bulleted lists the same.
  • Output list items without their actual numbers.
  • Flatten multi‑level outlines into a simple list or plain paragraphs.

That’s why a carefully structured legal document or technical spec with “1., 1.1, 1.1.1” often turns into a blob of text with dashes or nothing at all where the numbers used to be.

Styles and inheritance hide important meaning

Another invisible layer is Word’s style system. A heading or list item rarely has all of its formatting specified directly on the text. Instead, it might:

  • Use a paragraph style (e.g., “Heading 2” or “List Paragraph”).
  • Inherit from another style via “based on” relationships.
  • Combine that with local overrides (bold here, indentation tweak there).

The style defines not just how the text looks (font, size, color), but often what it means (this is a heading, this is a list item, this is a quote). Many extractors either ignore styles completely or reduce them to something very minimal. They might give you bold and italics, but lose the fact that a particular line was a “Heading 3” or a numbered heading tied to an outline level.

When that semantic information disappears, you lose the structure that made the document navigable and machine‑understandable in the first place. For tasks like building a table of contents, generating anchors, or mapping sections into another system, that’s a big problem.

Layout features don’t map neatly to plain formats

Word supports lots of layout constructs that just don’t have a clean equivalent in plain text or even in simple HTML/Markdown:

  • Multi‑column layouts.
  • Floating text boxes and shapes.
  • Footnotes and endnotes.
  • Cross‑references and fields.
  • Complex table structures.

When you try to extract content, you have to answer awkward questions like:

  • Do we inline footnote text where it appears, move it all to the bottom, or drop it?
  • How do we represent a side‑bar text box that visually appears on the side of a page?
  • What happens to a two‑column layout when your target format is linear?

Different libraries make different choices, but almost all of them end up sacrificing some combination of structure, semantics, or visual fidelity. Numbered headings are particularly fragile: many documents rely on multi‑level numbering for “1 Introduction, 1.1 Background, 1.1.1 Scope” and those numbers often come entirely from Word’s numbering system, not from text the author typed manually.

Open source tools make pragmatic trade‑offs

Re‑implementing all of Word’s logic outside of Word is a huge job. You’d have to fully understand the Office Open XML spec, then mimic how Word interprets it (including historical quirks and edge cases). That’s not realistic for most open source projects, which typically aim for:

  • Getting usable text out quickly.
  • Handling common documents “well enough.”
  • Keeping dependencies and code complexity low.

As a result, many libraries:

  • Focus on extracting plain text and maybe some inline formatting.
  • Only partially support lists and numbering, or treat everything as bullets.
  • Don’t fully resolve style inheritance or multi‑level outlines.
  • Ignore some of the more advanced layout and field features.

For simple documents, this works fine. For anything that leans heavily on Word’s numbering, styles, and layout features, you see the cracks: missing numbers, flattened hierarchy, lost headings, and content that no longer matches the author’s intent.

So why is Word extraction hard?

In the end, it’s hard because you’re trying to flatten a rich, layered, and somewhat idiosyncratic document model into something much simpler. Word was designed to display and print documents, not to be an easy, lossless source format. Numbered lists, headings, and styles are all defined indirectly and assembled at render time, while most open source extraction tools expect formatting to be explicit and local.

Until a library essentially behaves like a mini‑Word—understanding numbering definitions, style inheritance, and layout rules—some pieces, like auto‑generated numbering, will keep falling through the cracks.

Structured dataNumberingWord
Create your AI Agent

Automate customer interactions in just minutes with your own AI Agent.

Featured posts

Subscribe to our newsletter

Achieve more with AI

Enhance your customer experience with an AI Agent today. Easy to set up, it seamlessly integrates into your everyday processes, delivering immediate results.