Mobile app version of vmapp.org
Login or Join
Miguel251

: Serving PDF file via download script and via direct link: duplicated content? I have a website which hosts a PDF document (math paper). Main page of the website provides link to the document,

@Miguel251

Posted in: #DuplicateContent #Google #Seo

I have a website which hosts a PDF document (math paper). Main page of the website provides link to the document, which is
example.com/download.php?file=Document.pdf


The purpose of download.php script is to log IP addresses that download the document.

Now, the document can also be viewed/downloaded by following the link
example.com/Document.pdf

Does this setup has any disadvantages from the SEO perspective (duplicate content)? And if yes, how can I make it better?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Miguel251

1 Comments

Sorted by latest first Latest Oldest Best

 

@Jamie184

Yes, this is duplicate content. The same content is accessible from two different URLs and there is no canonicalisation.

Basically, this means that the search engines will pick one or the other to return in the SERPs. Ranking is essentially split between the two URLs.


both URLs are used for linking.


You need to decide which is the canonical/preferred URL and link only to that one URL.

For simplicity, we'll consider just the two URLs you've listed. The preferred URL would seem to be the one that goes via your download script (ie. download.php), otherwise you aren't going to be tracking the IPs of users downloading the file.

To resolve any already indexed URLs, you can externally redirect the direct link to your script. Assuming Apache, then you can do something like the following in your root .htaccess file:

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^([^/]+.pdf)$ /download.php?file= [R=302,L]


This will redirect a request for /Document.pdf (only if it exists as a physical file on the file system) to /download.php?file=Document.pdf.

is a backreference to the first captured group in the RewriteRule pattern (ie. ([^/]+.pdf)).

Change the 302 (temporary) redirect to a 301 (permanent) when you are sure it's working OK. 301s are cached by the browser so can make testing problematic.



A more "user friendly" URL (UPDATED)

You could take it one step further and create a more "user friendly" URL like /download/Document.pdf. This would then become the canonical URL - the URL that you link to.

In this case, since you have a file whose basename is also "download" (ie. download.php vs /download), you need to make sure that MultiViews is disabled. Otherwise mod_negotiation is likely to make an internal subrequest for download.php (depending on the request) before we've rewritten the URL. So, at the top of .htaccess:

Options -MultiViews


Any direct requests for /Document.pdf or /download.php?file=Document.pdf should be externally redirected to the canonical URL. For example:

RewriteCond %{REQUEST_FILENAME} -f
RewriteRule ^([^/]+.pdf)$ /download/ [R=301,L]

RewriteCond %{THE_REQUEST} GET /download.php HTTP
RewriteCond %{QUERY_STRING} ^file=(.+.pdf)$
RewriteRule ^download.php$ /download/%1 [R=301,L]


%1 (as opposed , mentioned above) is a backreference to the last matched RewriteCond CondPattern (ie. (.+.pdf)).

The additional RewriteCond (condition) that checks against THE_REQUEST is necessary in order to prevent a redirect loop. (THE_REQUEST contains the original request header and does not change when the URL is rewritten.)

/download/Document.pdf would then be internally rewritten to the "real" URL. ie. /download.php?file=Document.pdf. An internal rewrite, as it suggests is internal to the server. There is no external HTTP request. The URL in the address bar does not change. It is completely hidden from the end user.

RewriteRule ^download/([^/]+.pdf)$ download.php?file= [L]


Note that there is no R (redirect) flag on this directive that would otherwise trigger an external redirect.

Ideally, you would make the regex as restrictive as possible. For example, in the above regex, .+ matches any characters (1 or more). However, if your filenames consist of only upper and lowercase letters then it would be preferable to change the regex to match only letters. eg. [a-zA-Z]+.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme