Mobile app version of vmapp.org
Login or Join
Jamie184

: Is there a good way to include plain text versions of embedded documents for search engines to find? I'm developing a site that's entire function is defined by the documents it provides. The

@Jamie184

Posted in: #Embed #SearchEngines #Seo

I'm developing a site that's entire function is defined by the documents it provides. The documents are provided to the user both through embedding in the page (using the Google Drive Viewer) as well as through download links.

My question is, if I have the plaintext for each of these documents (ranging from PDFs to MS Office files), is there a good way to make this content visible to the search engine spiders?

I thought of including the entire doc file content in the alt attribute, but given many of these documents are over 100 pages, I would assume that will break something somewhere. Also thought about setting the download and href src to the plaintext then use JS to change the link destination onload, but I would assume search engines will detect and not look fondly on that.

Is there a good, standard practice for handling something like this or am I basically sol for getting this content noticed?

10.03% popularity Vote Up Vote Down


Login to follow query

More posts by @Jamie184

3 Comments

Sorted by latest first Latest Oldest Best

 

@Megan663

What you should do:


Ask users about sharing their docs over Internet
Create sitemap.xml with users list
Create sitemap_username1.xml with users data and list of its docs


I think you url structure looks

/ --root

/username/ -- user data

/username/docnameX -- doc

/username/docnameX?plaintext -- doc in plain text

OR even better case use #

but stackoverflow.com/questions/2181186/how-to-access-url-hash-fragment-from-a-django-request-object

but whocares stackoverflow.com/questions/3847870/php-to-get-value-of-hashtag-from-url
/username/docnameX#plaintext -- doc in plain text

So you create
/sitemap.xml

with next

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://sitename.tld/sitemap_users.xml</loc>
<lastmod>2007-04-18T12:05:20-04:00</lastmod>
</sitemap>
<sitemap>
<loc>http://sitename.tld/sitemap_else.xml.gz</loc>
<lastmod>2006-07-28T08:42:17-04:00</lastmod>
</sitemap>
</sitemapindex>


I got this example from edition.cnn.com/sitemap_index.xml
And yes, you can create gz archive with xml too

/sitemap_users.xml

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>http://sitename.tld/sitemap_username1.xml</loc>
<lastmod>2007-04-18T12:05:20-04:00</lastmod>
</sitemap>
<sitemap>
<loc>http://sitename.tld/sitemap_username2.xml</loc>
<lastmod>2006-07-28T08:42:17-04:00</lastmod>
</sitemap>

</sitemapindex>


/sitemap_username1.xml

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://sitename.tld/username1/</loc>
<lastmod>2012-07-27</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://sitename.tld/username1/docname1</loc>
<lastmod>2011-09-18</lastmod>
<changefreq>daily</changefreq>
<priority>0.4</priority>
</url>
</urlset>


Do not insert links to you text copy of document.

You may have problems with content duplicating. I did'n find right rules what have to do in current situation.

Google anyway find your text copy so just wait and check google cache with request like site:sitename.tld cache:sitename.tld.

If you find a lot of plaintext copies of docs on hight places you may try use this solutions stackoverflow.com/questions/677419/how-to-detect-search-engine-bots-with-php , stackoverflow.com/questions/10613025/how-can-i-use-serverhttp-referer-to-find-that-user-came-from-google
About big sitemaps stackoverflow.com/a/4241045/1346222 and dynamical.biz/blog/seo-technical/sitemap-strategy-large-sites-17.html

You can provide multiple Sitemap files, but each Sitemap file that you provide must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760 bytes). If you would like, you may compress your Sitemap files using gzip to reduce your bandwidth requirement; however the sitemap file once uncompressed must be no larger than 10MB. If you want to list more than 50,000 URLs, you must create multiple Sitemap files.


Prefer don't spoof SE (search engines) with lastmod changefreq priority - this is recommendation for SE and it will recalculate it anyway independently.


I thought of including the entire doc file content in the alt attribute


Good idea put in title 5-10 words from first paragraph of doc if you put all text SE may mark this as spam.

Good topic about sitemaps from Google googlewebmastercentral.blogspot.com/2008/01/sitemaps-faqs.html

10% popularity Vote Up Vote Down


 

@Cooney921

If you search filetype:doc and filetype:docx and so on in Google, you'll see most common Office files, and PDFs, are routinely indexed.

They perhaps won't perform as well as a properly optimised web page, but for the kind of searches that people who want that sort of thing do (e.g., people looking for academic papers and so on), I can't see there being an issue there.

Incidentally, definitely don't stuff alt attributes with whole documents worth of text: I'd be concerned about that looking like spam.

+1 for John's Sitemap suggestion, too. That's always good practice.

10% popularity Vote Up Vote Down


 

@Sarah324

Use an XML Sitemap. It allows you to tell search engines about every file you wish for them to index.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme