Mobile app version of vmapp.org
Login or Join
Cugini998

: Getting text from a Word doc with no subscription? So I don't have a legit (or non-legit) copy of MS Office. I really don't want to buy this sorry software, but my client keeps sending me

@Cugini998

Posted in: #MicrosoftWord

So I don't have a legit (or non-legit) copy of MS Office. I really don't want to buy this sorry software, but my client keeps sending me Word 2016/365 docs with markups and crazy formatting and I'm having trouble lifting the text (don't need the formatting) for copying into InDesign. They are very resistant to sending me a PDF for some reason.

I've tried Word Reader, and Word Online seems to have a suspicious "sync" option for uploading the Word docs there, so I want to avoid this option too. Any other ways to get the text out of the Word doc? Thanks!

10.04% popularity Vote Up Vote Down


Login to follow query

More posts by @Cugini998

4 Comments

Sorted by latest first Latest Oldest Best

 

@Turnbaugh909

Getting the plain text into InDesign

"For copying"? No need for such drastic measures, just use the regular Place ... function in InDesign. Make sure to click the Show import options checkbox, as this will show a dialog after you selected a file.

In this dialog, select "Remove Styles and Formatting from Text and Tables". Press OK to import the file with these settings. You will see the imported text is stripped of all "markups and crazy formatting".

However, be warned that some formatting adds text! For instance, bulleted and numbered lists count as 'formatting', and "clearing" the formatting from these paragraphs makes the bullet and number disappear. A way around this is to open the file with a Word-compatible editor that can convert such items to plain text (Microsoft Word itself can do that; I don't know of any others).



A good alternative to the above is to import the file with formatting, and then clearing all overrides. See the online Help for that. The advantage over importing without any formatting is that you can use InDesign's built-in Convert bullets and numbering to text function before removing the formatting that added them.



Lifting text straight out of the file

A .docx file is a compound document format with all of its constituent parts zipped into one file. You can extract the individual files out of it with any PKZip compatible software (you may need to change the file extension to .zip for some). After that, you get a folder full of data. The file that contains the plain text is usually word/document.xml. Pay attention to the other files, though; I see one in my test that is more than 2Kb, called word/footnotes.xml.

These are XML files, and getting just the plain text out of it is still not an easy operation. I wrote up a quick and dirty XSLT transformation sheet to extract the plain text and found a couple of caveats right away.

A few: paragraphs are surrounded by XML tags, not with returns of any kind. So discarding all XML tags indiscriminately is not an option.

Similar to the returns, tabs are not physically present in the file but also have a tag of their own.

Also, not all plain text should appear in your output. Some tags' data are written out as plain text, and, in reverse, some tags seem to indicate their associated paragraph should not be included 'in' the visible text.

Finally, the same problem as when you discard all formatting: bullets and numbered lists are not physically present in the file but only indicated through tags. Not researched any further; you'd need quite a complex stylesheet to correctly preserve these.

Keeping the above in mind, this is the XSLT stylesheet I wrote up.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.1"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
>

<xsl:output method="text" indent="no" />

<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>

<!-- do not include paragraphs marked 'vanish' -->
<xsl:template match="w:p">
<xsl:if test="not(.//w:vanish)">
<xsl:apply-templates />
<xsl:text>

</xsl:text>
</xsl:if>
</xsl:template>

<xsl:template match="w:tab">
<xsl:text>&#9;</xsl:text>
</xsl:template>

<!-- this tag contains data as plain text, not as an attribute -->
<!-- Tsk tsk. (There could be more like these.) -->
<xsl:template match="wp:posOffset">
</xsl:template>

</xsl:stylesheet>

10% popularity Vote Up Vote Down


 

@Gloria351

You can also use the free Microsoft office web applications through Onedrive if you have a Microsoft account on any of their platforms (Word, PowerPoint, etc.). Or you can create an account for free I believe.

10% popularity Vote Up Vote Down


 

@Miguel516

I had a client sending me docx that I could not open. I asked them to send me doc format but they didn't seem know how. So I would email them to myself with my gmail account and open the files with google docs! Simple!
I think you can even save the files as pdf in google docs.

10% popularity Vote Up Vote Down


 

@Si6392903

Have you tried Libre Office or Open Office? Both free.

You also could use a trial peroid of word.

And if you install Word Viewer on a computer without internet, or internet turned off for a moment? You could print your document to a pdf file, and turn on again the internet.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme