Monday, September 11, 2006

ServletRequest.getParameter and UTF-8

When we get a parameter from an HTTP request in the Tomcat Servlet container, the String object returned isn't UTF-8-aware. If this bugs you down, you can work around it:

String value = request.getParameter("key");
if (value != null) {
try {
value = new String(value.getBytes(),"UTF-8");
} catch (java.io.UnSupportedEncodingException uee) {
//wrong encoding!
}
}
I don't know whether this is behavior expected from the HttpServlet spec.

update: 09/13/2006: the above code doesn't quite work. I'm now pretty sure it was a useless attempt, heh. getParameter URLdecodes the value for you, but it doesn't do it in a UTF-8-aware way, and my "workaround" can't possibly work around that limitation. duh.

You just want to request.getQueryString() to get the raw UTF-8 URL-encoded query string sent by the HTTP client. And then you want to manually extract the value you need from the key, and run it through java.net.URLDecoder.decode(theValue,"UTF-8");

If you want to pass this value to an XSLT transformation parameter using Xalan, you'll also run into utf8-awareness limitation. You'll want to pass a UTF8-URLEncoded version of the $encodedValue to the transformation. Then inside the XSLT stylesheet, declare a variable like this: xsl:variable name="decodedValue" select="java:java.net.URLDecoder.decode($encodedValue, 'UTF-8')" ... assuming you've enabled "java" as an extension by declaring its namespace.

update: 09/26/2006 Upon reading this article on UTF-8 and request.getParameter from jGuru, a better approach appears to be:
if(request.getCharacterEncoding() == null)
request.setCharacterEncoding("UTF-8");
paramValue = request.getParameter("paramKey");
Basically, the servlet engine needs to be told to retrieve parameters using UTF-8, as browsers don't always send accurate information as to what encoding is being used in a form submission.

update:12/05/2006 Anonymous poster below points us to this discussion about setCharacterEncoding having no effect

6 comments:

Anonymous said...

I think you'll find that Tomcat, for one, ignores setCharacterEncoding as far as interpreting request parameters goes.

See http://issues.apache.org/bugzilla/buglist.cgi?query_format=specific&order=relevance+desc&bug_status=__closed__&product=Tomcat+5&content=getparameter+setcharacterencoding for examples of the discussions.

joop_eggen said...

s.getBytes("Cp1252");

Explanation: you getBytes uses the platform encoding of the server. One would like to use the W3C default ISO-8859-1. But ALL browsers interprete ISO-8859-1 as Cp1252 (even on Mac). Hence maybe better to use Windows-1252 aka Cp1252.
- Joop Eggen

Anonymous said...

Just read:
http://wiki.apache.org/tomcat/Tomcat/UTF-8
and port your application to UTF-8 like everyone else

Anonymous said...

Thanks. It runs!

Gopi Krishna Dhulipalla said...
This comment has been removed by the author.
Fillip Kirkorov said...

Oh my goodness! an amazing article dude. Thanks Nevertheless I am experiencing situation with ur rss . Don’t know why Unable to subscribe to it. Is there anybody getting similar rss problem? Anyone who is aware of kindly respond. Thnkx online gambling casino