: How to keep friendly URLs consistent? I downloaded a site with a crawler script (HTTrack) and now have a few hundred HTML files that need to be edited and re-deployed. The original site ran

Posted in: #Htaccess #Url #UrlRewriting #WebCrawlers #WebDevelopment

I downloaded a site with a crawler script (HTTrack) and now have a few hundred HTML files that need to be edited and re-deployed.

The original site ran on a combination of Drupal and a lesser known, proprietary CMS. All URLs were "clean" (no .html extensions) and ended with a trailing slash.

The URL structure of the downloaded files, however, is not consistent at all. Some URLs that used to end with a trailing slash, for example, example.com/training/ were downloaded as example.com/training/index.html. That in itself is not a problem, because when redeployed, that URL will properly resolve to /training/, as long as I don't link to the index.html directly.

A large part of URLs, however, was downloaded with a different naming scheme. For example, example.com/about-us/ was downloaded as example.com/about-us.html. I have no idea what caused this lack of consistency, and now I face a dilemma about how to redeploy the site. It seems my options are limited to the following:

The files that were downloaded as page/index.html can be uploaded as is. If I change all internal links with "Find and replace," those pages will function as before, with the trailing slash.

Downside:

Confusing to maintain on a PC because of a large number of indentical file names (index.html)

The URLs of files that were downloaded as page.html can be "cleaned up" with an .htaccess rule to remove .html.

Downside:

The URLs will lose the trailing slash.
Directories and files won't be able to have the same name, e.g. example.com/technology and example.com/technology/methods.html, because that would break Apache

Either way, I think it would be prudent to either have the trailing slash in every URL or not have it anywhere. What is the best way to keep these URLs consistent, and what are some of the ways to avoid the downsides of each method described above?

10.01% popularity Vote Up Vote Down

: Can I manipulate the DNS protocol to host websites across multiple servers and direct users to the geographically closest? Let's pretend that I'm hosting a very big and popular website. Let's

@Gail5422790

Posted in: #Apache #Dns #Server

4 Comments

: How to update Adwords Shopping Campaign I have a question on Adwords Shopping Campaign but can't find the answer anywhere. I generated a Google shopping feed and submitted it to Google Merchant

@Gail5422790

Posted in: #Google #GoogleAdwords

0 Comments

: Google Analytics Exclusion on IPv4 and IPv6 I am trying to filter out "my own" aka "internal" traffic as a new View in Google Analytics. I was able to filter out my own traffic by excluding

@Gail5422790

Posted in: #GoogleAnalytics #IpAddress #Ipv6

1 Comments

: When does a web site become commercial when using creative commons? When does a web site become commercial when using creative commons licenses? I have a free-to-use hobby site that uses a

@Gail5422790

Posted in: #CreativeCommons #Legal

1 Comments

Login to post a comment!

1 Comments

Sorted by latest first Latest Oldest Best

@Heady270

Just keep it like that, clean.

It's quite easy to remove index.html from a URL with mod_rewrite. Let's say we want to redirect example.com/index.html to dense13.com:
RewriteEngine On
RewriteRule ^index.html$ / [R=301,L]

If you're not familiar with .htaccess syntax the RewriteRule directive has three parts, a pattern (^index.html$), a substitution (/) and optionally some modifiers ([R=301,L]).
In the pattern: the symbol ^ means "start with", and the symbol $ means "ends with". Also, the backslash is the escape character, and we need to put it in front of the dot, because the dot normally has a special meaning, and we don't want that here. So in this case the pattern will only match the string "index.html".

If the pattern is found (that is, if the request is to index.html), it will be redirected to "/", which is the root of your website.

.... And if you want to always remove index.html? For example, example.com/music/index.html -> dense13.com/music/ . Easy!

RewriteEngine On
RewriteRule ^index.html$ / [R=301,L]
RewriteRule ^(.*)/index.html$ // [R=301,L]

The second rewrite rule checks for any request that ends with /index.html, and removes the index.html bit. Again, quick explanation of the second rewrite rule.

10% popularity Vote Up Vote Down

Feed

: How to keep friendly URLs consistent? I downloaded a site with a crawler script (HTTrack) and now have a few hundred HTML files that need to be edited and re-deployed. The original site ran

More posts by @Gail5422790

: Can I manipulate the DNS protocol to host websites across multiple servers and direct users to the geographically closest? Let's pretend that I'm hosting a very big and popular website. Let's

: How to update Adwords Shopping Campaign I have a question on Adwords Shopping Campaign but can't find the answer anywhere. I generated a Google shopping feed and submitted it to Google Merchant

: Google Analytics Exclusion on IPv4 and IPv6 I am trying to filter out "my own" aka "internal" traffic as a new View in Google Analytics. I was able to filter out my own traffic by excluding

: When does a web site become commercial when using creative commons? When does a web site become commercial when using creative commons licenses? I have a free-to-use hobby site that uses a

Login to post a comment!

1 Comments

Back to top | Use Dark Theme