What is the difference between hex bytes output using different types of encoding schemes in C#? -
consider following c# code
int x = 126; string s = "126"; filestream fs = new filestream("test.txt", filemode.create); streamwriter sw = new streamwriter(fs); sw.writeline(x); sw.writeline(s);
the output(in hex bytes stored in test.txt) 31 32 36 0d 0a 31 32 36 0d 0a
if make changes line 4:
streamwriter sw = new streamwriter(fs, encoding.unicode);
the output is: ff fe 31 00 32 00 36 00 0d 00 0a 00 31 00 32 00 36 00 0d 00 0a 00
could me logic. there reference regarding different encoding schemes , behavior file systems using c#
i suggest read joel spolsky's excellent article on subject of character sets , encodings. in short:
- a file sequence of bytes.
- a string sequence of characters.
- a character set defines collection of characters , assignes unique code point (an integer represents character - note "integer" not
int
) each character. - when want store string in file, need convert character sequence byte sequence. character sets 256 characters or less, there one-to-one correspondence between characters , bytes, bigger character sets, such unicode, gets more complicated.
- an encoding defines how code points characters of string should translated bytes.
therefore, when change encoding, same string gets translated different sequence of bytes.
note behavior of character sets , encodings independent of programming language. change how refer , use various encodings , character sets (usually, encoding tied particular character set, selecting encoding implicitly select character set). in c#'s case, encoding.unicode
poorly named - it's unicode character set, utf-16le encoding (in every second byte 00
if use english characters).
also, note strings represented char
arrays internally in program, each char
value represents 2 subsequent bytes utf-16 encoding (so fancy characters might represented 2 char
values). can't access array directly, , of string functionality tries abstract away fact. internal encoding doesn't affect how strings written files (either, select encoding manually, or default character set of operation you're invoking - streamwriter
utf-8 (thanks @xanatos correction)).
Comments
Post a Comment