Mobile app version of vmapp.org
Login or Join
Gail5422790

: How to keep friendly URLs consistent? I downloaded a site with a crawler script (HTTrack) and now have a few hundred HTML files that need to be edited and re-deployed. The original site ran

@Gail5422790

Posted in: #Htaccess #Url #UrlRewriting #WebCrawlers #WebDevelopment

I downloaded a site with a crawler script (HTTrack) and now have a few hundred HTML files that need to be edited and re-deployed.

The original site ran on a combination of Drupal and a lesser known, proprietary CMS. All URLs were "clean" (no .html extensions) and ended with a trailing slash.

The URL structure of the downloaded files, however, is not consistent at all. Some URLs that used to end with a trailing slash, for example, example.com/training/ were downloaded as example.com/training/index.html. That in itself is not a problem, because when redeployed, that URL will properly resolve to /training/, as long as I don't link to the index.html directly.

A large part of URLs, however, was downloaded with a different naming scheme. For example, example.com/about-us/ was downloaded as example.com/about-us.html. I have no idea what caused this lack of consistency, and now I face a dilemma about how to redeploy the site. It seems my options are limited to the following:

The files that were downloaded as page/index.html can be uploaded as is. If I change all internal links with "Find and replace," those pages will function as before, with the trailing slash.

Downside:


Confusing to maintain on a PC because of a large number of indentical file names (index.html)


The URLs of files that were downloaded as page.html can be "cleaned up" with an .htaccess rule to remove .html.

Downside:


The URLs will lose the trailing slash.
Directories and files won't be able to have the same name, e.g. example.com/technology and example.com/technology/methods.html, because that would break Apache


Either way, I think it would be prudent to either have the trailing slash in every URL or not have it anywhere. What is the best way to keep these URLs consistent, and what are some of the ways to avoid the downsides of each method described above?

10.01% popularity Vote Up Vote Down


Login to follow query

More posts by @Gail5422790

1 Comments

Sorted by latest first Latest Oldest Best

 

@Heady270

Just keep it like that, clean.

It's quite easy to remove index.html from a URL with mod_rewrite. Let's say we want to redirect example.com/index.html to dense13.com:
RewriteEngine On
RewriteRule ^index.html$ / [R=301,L]


If you're not familiar with .htaccess syntax the RewriteRule directive has three parts, a pattern (^index.html$), a substitution (/) and optionally some modifiers ([R=301,L]).
In the pattern: the symbol ^ means "start with", and the symbol $ means "ends with". Also, the backslash is the escape character, and we need to put it in front of the dot, because the dot normally has a special meaning, and we don't want that here. So in this case the pattern will only match the string "index.html".

If the pattern is found (that is, if the request is to index.html), it will be redirected to "/", which is the root of your website.

.... And if you want to always remove index.html? For example, example.com/music/index.html -> dense13.com/music/ . Easy!

RewriteEngine On
RewriteRule ^index.html$ / [R=301,L]
RewriteRule ^(.*)/index.html$ // [R=301,L]


The second rewrite rule checks for any request that ends with /index.html, and removes the index.html bit. Again, quick explanation of the second rewrite rule.

10% popularity Vote Up Vote Down


Back to top | Use Dark Theme