Learn Japanese with JapanesePod101.com

View topic - Important: Shift-JIS to UTF-8

Important: Shift-JIS to UTF-8

Do you have any suggestions to the page's content and such?

Important: Shift-JIS to UTF-8

Postby clay » Wed 01.09.2008 9:11 pm

Our Australian friends (who are working on moving the data of the current site and converting it to Drupal and PHPbb format) ran into a 文字化け problem. The encoding here is the old Shift-JIS encoding and the new format will be UTF-8. The moved test data was about half just fine and half messed up.

Are there any geniuses here who know how to convert Shift-JIS data to UTF-8? If someone has done this before, I would be willing to pay for his or her time.

Or if someone knows of a better character set for encoding Japanese, please let me know.

Please reply here or email me directly at clay AT thejapanesepage.com.

Once we get this figured out, it looks like we can make the move!

Many thanks!
TheJapanShop.com- Japanese language learning materials
Checkout our iPhone apps: TheJapanesePage.com/iPhone
User avatar
clay
Site Admin
 
Posts: 2809
Joined: Fri 01.21.2005 9:39 am
Location: Florida

RE: Important: Shift-JIS to UTF-8

Postby chikara » Wed 01.09.2008 9:32 pm

I have never done this myself but here are some examples of how it can be done. Do your Aussies have any coding knowledge?

Python
http://www.thescripts.com/forum/thread45142.html

C++
http://www.example-code.com/vcpp/utf-8-shift-jis.asp

Java

Something along the lines of;

// For each file to be converted
InputStreamReader isr = new InputStreamReader(new FileInputStream("c:\\temp\\old.txt"), "SJIS"));
BufferedReader reader = new BufferedReader(isr);
OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream("c:\\temp\\new.txt"), "UTF-8"));
BufferedWriter writer = new BufferedWriter(osw);
String line = reader.readLine();
while(line != null) {
writer.write(line);
writer.newLine();
line = reader.readLine();
}

reader.close();

writer.close();
Last edited by chikara on Wed 01.09.2008 9:35 pm, edited 1 time in total.
Don't complain to me that people kick you when you're down. It's your own fault for lying there
User avatar
chikara
 
Posts: 3576
Joined: Tue 07.11.2006 10:48 pm
Location: Australia (SA)
Native language: English (Australian)
Gender: Male

RE: Important: Shift-JIS to UTF-8

Postby Harisenbon » Wed 01.09.2008 9:52 pm

PHP also has some built in functions for it, but it's multibyte (japanese) support is severely lacking.

Worse comes to worse you just run it through a textpad that can convert JIS to UTF and make it run a batch translate on all the files (EditPlus can do this).
Want to learn Japanese the right way? How about for free?
Ippatsu // Japanesetesting.com
User avatar
Harisenbon
 
Posts: 2964
Joined: Tue 06.14.2005 3:24 am
Location: Gifu, Japan
Native language: (poor) English

RE: Important: Shift-JIS to UTF-8

Postby clay » Wed 01.09.2008 9:53 pm

I saw the top site earlier, but it was over my head. I will tell them to follow this thread for any good links/advice. Maybe it will mean something to them.

EDIT: Thanks both of you! I'll pass on that info.

EDIT, EDIT: Am I correct in assuming UTF-8 is the best choice for the final encoding?
Last edited by clay on Wed 01.09.2008 9:59 pm, edited 1 time in total.
TheJapanShop.com- Japanese language learning materials
Checkout our iPhone apps: TheJapanesePage.com/iPhone
User avatar
clay
Site Admin
 
Posts: 2809
Joined: Fri 01.21.2005 9:39 am
Location: Florida

RE: Important: Shift-JIS to UTF-8

Postby Wakannai » Thu 01.10.2008 1:00 am

I think utf is better. Shift-jis has some major issues. Also, I understand, most of the world is moving that direction.
Wakannai
 
Posts: 658
Joined: Thu 10.18.2007 6:38 am

RE: Important: Shift-JIS to UTF-8

Postby Harisenbon » Thu 01.10.2008 1:20 am

The only problem with UTF-8 is that not all cellphone carriers in Japan support it (Shift-JIS is still the defacto standard). However, international pages are better served in UTF-8, in my personal opinion. Whatever you to stay away from the blasphemy-unto-god that EUC is. For some reason tons of JP programmers use EUC but it's not readily supported by most file editors.
Want to learn Japanese the right way? How about for free?
Ippatsu // Japanesetesting.com
User avatar
Harisenbon
 
Posts: 2964
Joined: Tue 06.14.2005 3:24 am
Location: Gifu, Japan
Native language: (poor) English

RE: Important: Shift-JIS to UTF-8

Postby chikara » Thu 01.10.2008 2:30 am

UTF-8 is part of the Unicode standard, Shift-Jis is not.
Don't complain to me that people kick you when you're down. It's your own fault for lying there
User avatar
chikara
 
Posts: 3576
Joined: Tue 07.11.2006 10:48 pm
Location: Australia (SA)
Native language: English (Australian)
Gender: Male

RE: Important: Shift-JIS to UTF-8

Postby Hark » Thu 01.10.2008 6:13 am

What do you mean by half fine and half messed up? Are older data stored in different way than newer (but everything works nevertheless)? Or you just mean that some characters are OK and some are not (but coherently across whole DB)?

What about old plain recode?
http://linux.die.net/man/1/recode


Bonus story:
I used old mysql extension in php and to enable diacritics I experimented and ended up with this setting (I didn't knew what mess I'd created. I just tried to change parameter here and there and suddenly it worked). Data was actually stored in UTF8. But what was coded was not directly text but it's CP1252 code. And to kill it totally, when decoded, text was in CP1250. When I tried to make a new version I found that everything can be recoded but ť, Ť characters which are stored in totally alien encoding. Strange thing is, that in browser it all worked. I don't know how and why. Now I'm strictly using UTF8 everywhere.
User avatar
Hark
 
Posts: 73
Joined: Sat 03.11.2006 6:30 am
Location: Bratislava
Native language: Slovenčina (Slovak)
Gender: Male

RE: Important: Shift-JIS to UTF-8

Postby clay » Thu 01.10.2008 6:54 am

Thanks everyone!

Building on what Harisenbon said, here is a tip from a friend's friend:
The MultyByte String library, (mbstring library) that comes with PHP has a function called mb_convert_encoding that do that
http://www.php.net/manual/en/function.m ... coding.php
Normally this library is not loaded in by default in php4.x... he just need to include it.

Also he can use the jcode.php library developed by http://www.spencernetwork.org/jcode/ (but is more complete the mb library)
TheJapanShop.com- Japanese language learning materials
Checkout our iPhone apps: TheJapanesePage.com/iPhone
User avatar
clay
Site Admin
 
Posts: 2809
Joined: Fri 01.21.2005 9:39 am
Location: Florida

RE: Important: Shift-JIS to UTF-8

Postby arbalest71 » Fri 01.18.2008 7:22 am

iconv should also work, under Linux/Unix, and is probably the simplest way to handle it if you have a Linux system available. 'iconv -l | grep JIS' will tell you if the version of iconv available knows Shift-JIS, and 'man iconv' will tell you how to use it- I would be a bit surprised to see a version that didn't recognize Shift-JIS, at this late date. There's not much reason to pay someone to convert some files from Shift-JIS to UTF-8- you could boot Linux from cd even if you are win only.. and I'd guess that iconv runs fine under cygwin. If you have to suck a lot of fields out of a db, convert them, and then recreate the db it's a bit more complicated, but that's not really an encoding problem.
arbalest71
 
Posts: 142
Joined: Wed 10.11.2006 8:44 pm


Return to Suggestions

Who is online

Users browsing this forum: No registered users and 1 guest

cron