: Can file encoding change when FTP is used? I'm developing a web application which downloads a number of web pages using PHP curl. It then uses diff to compare the files as they change each
I'm developing a web application which downloads a number of web pages using PHP curl. It then uses diff to compare the files as they change each day.
I reported a problem a few weeks back where seemingly identical files were being flagged by diff as being different: stackoverflow.com/questions/42552239/different-versions-of-diff-giving-mixed-results-when-comparing-2-identical-fil
The answer to the above was that if diff was used with the -w flag it ignores whitespace.
However, I've now noticed a separate problem. If I download one of the files I'm comparing, and re-upload (overwrite) it through an FTP client, the output changes.
For example: Compare file1.html against file2.html with diff file1.html file2.html it may give output such as
12159,12161c12159,12161
<
<
<
---
>
>
>
12163,12172c12163,12172
<
<
<
<
<
<
<
<
<
<
---
However, if I download file2.html to my desktop and re-upload it through FTP, diff without the -w flag reports there being no differences at all i.e. it's now saying the files are identical.
I've tried to check the encoding of the file using file -bi file2.html but it's reported the same before and after upload through FTP. The encoding is text/html; charset=us-ascii
If the encoding is no different and the file contents have not been modified, how is re-uploading the file through FTP changing anything?? I've tried it using FileZilla and also through Netbeans.
I'm using macOS Sierra locally and the remote server is Apache 2/PHP 7/centOS.
More posts by @Speyer207
4 Comments
Sorted by latest first Latest Oldest Best
Yes. Don't transfer UTF-16 files in ASCII mode; use binary mode to avoid data corruption here.
FTP's transformation of rn to n will corrupt the remainder of the file if it happens to contain the single character ഊ or the sequence ㄍਰ or many others of the same class.
Please note that this is not an intelligent transformation, and the reverse transformation also exists and covers quite a few more cases.
The answer I've accepted on this is correct however I'm adding some notes about how I used this information to work out what was going on.
The question uses file1.html and file2.html for simplicity. In reality file1.html represents a "master" copy of a web page downloaded in the past. file2.html is the most recent download of the web page content. The intention of the application is to compare the master copy of the file, against the latest version (diff file1.html file2.html).
There are hundreds of these files in the real application.
What I'd done when creating the master files was to download all a set of file1.html-equivalents on to my desktop. I then re-uploaded them via FileZilla into a "master" directory on the server. This was done some time ago (about 1st March) and I hadn't thought of this until realising what had happened.
Uploading through FileZilla from a Mac has introduced changes in the new-line character, as described in the accepted answer. Specifically it's using r whereas on the Linux (centOS) web server, it's using n.
So now when I diff file1.html file2.html it was saying the files are different. This is because at this point there are 2 different new line characters between the files: r in file1.html and n in file2.html
What I was then doing was downloading file2.html and re-uploading (overwriting) it from my desktop to the server with FileZilla. This is the point where I posted the original question.
As the answer suggests, there is a character encoding difference until file2.html is uploaded. At this point the files become the same because they have both gone through the same process. From my perspective it was as if merely uploading file2.html "fixed" the problem, but couldn't understand why.
I could determine which files were using which new line characters using stackoverflow.com/questions/3569997/view-line-endings-in-a-text-file
My solution to the whole problem is to not download anything through FTP - as this was merely being used to examine the files - and rather use cp on the server when creating the "master" directory of files. Because this is all done on Linux there are no differences in the new line characters, the copy of each file means they are always using n
Basically the solution to this problem is that the new line character has to be the same between files otherwise diff will flag them as being different, but the output it gives is not helpful in showing you what that difference is, since these characters are "invisible" when the file is displayed (unless you use something like on the link above to show them in an editor).
Yes , ftp does some encoding changes. Data is transferred from a storage device in the sending Host to a storage device in the receiving Host. Often it is necessary to perform certain transformations on the data because data storage representations in the two systems are different. For example, NVT-ASCII has different data storage representations in different systems. PDP-10's generally store NVT-ASCII as five 7-bit ASCII characters, left-justified in a 36-bit word. 360's store NVT-ASCII as 8-bit EBCDIC codes. Multics stores NVT-ASCII as four 9-bit characters in a 36-bit word. It may be desirable to convert characters into the standard NVT-ASCII representation when transmitting text between dissimilar systems. The sending and receiving sites would have to perform the necessary transformations between the standard representation and their internal representations.(please refer Data representation and storage section of RFC 765 for further detail).
You are probably seeing a difference in line-endings. When transferring a file in ASCII/Text mode (as opposed to "Binary mode") then most FTP clients will convert/normalise line-endings to the OS being transferred to.
On Classic Mac OS (9.x and earlier) the line-ending char is simply r (ASCII 13), on Mac OS X this changed to n (ASCII 10), on Linux it is n (ASCII 10). And Windows is rn or ASCII 13+10. (Thanks @8bittree for the Mac correction.)
So, when downloading from one OS to another all line-endings are silently converted. The conversion is reversed when uploaded. (However, as noted in @Joshua 's answer this can result in corruption, depending on the file's character encoding and specific characters contained in the file.) If there is a mishmash of line-endings then it's possible the FTP software is normalising/fixing the line-endings. This would explain why downloading and then uploading the file results in a "different" file to what was originally on the server (ie. it is "fixed"). Or it is reverting a previously miss-converted file? However, the EOL-conversion may not be so intelligent and you can just end up with either double spaced lines or missing line breaks altogether (ie. mildly corrupted).
By default, most FTP clients are set to "Auto" transfer mode and have a list of known file types to transfer in ASCII/Text mode. Other file types are transferred in "Binary" mode. If you are transferring between the same OS, or you wish to transfer with no conversion, then you should use "Binary" mode only.
Ordinarily, the FTP software will not change the character encoding of the transferred file unless the source/target operating systems use a very different character encoding with which to represent text files. As @KeithDavies noted in comments, one such example is when downloading from a mainframe, that uses EBCDIC, to a local Windows machine. EBCDIC is not supported natively by Windows, so a conversion is required to convert this to ASCII. Again, transferring in "Binary mode" avoids any such conversion. (Thanks to @KeithDavies for the note regarding character encoding.)
The answer to the above was that if diff was used with the -w flag it ignores whitespace.
Yes, line-endings (whitespace) are ignored in the comparison.
If I download one of the files I'm comparing, and re-upload (overwrite) it through an FTP client, the output changes.
If there was a mixture of line-endings in the original file then downloading and re-uploading in ASCII mode could well "fix" the inconsistent line-endings. So, the files are now "the same".
Terms of Use Create Support ticket Your support tickets Stock Market News! © vmapp.org2024 All Rights reserved.