Monday, September 11, 2006

ServletRequest.getParameter and UTF-8

When we get a parameter from an HTTP request in the Tomcat Servlet container, the String object returned isn't UTF-8-aware. If this bugs you down, you can work around it:

String value = request.getParameter("key");
if (value != null) {
try {
value = new String(value.getBytes(),"UTF-8");
} catch (java.io.UnSupportedEncodingException uee) {
//wrong encoding!
}
}
I don't know whether this is behavior expected from the HttpServlet spec.

update: 09/13/2006: the above code doesn't quite work. I'm now pretty sure it was a useless attempt, heh. getParameter URLdecodes the value for you, but it doesn't do it in a UTF-8-aware way, and my "workaround" can't possibly work around that limitation. duh.

You just want to request.getQueryString() to get the raw UTF-8 URL-encoded query string sent by the HTTP client. And then you want to manually extract the value you need from the key, and run it through java.net.URLDecoder.decode(theValue,"UTF-8");

If you want to pass this value to an XSLT transformation parameter using Xalan, you'll also run into utf8-awareness limitation. You'll want to pass a UTF8-URLEncoded version of the $encodedValue to the transformation. Then inside the XSLT stylesheet, declare a variable like this: xsl:variable name="decodedValue" select="java:java.net.URLDecoder.decode($encodedValue, 'UTF-8')" ... assuming you've enabled "java" as an extension by declaring its namespace.

update: 09/26/2006 Upon reading this article on UTF-8 and request.getParameter from jGuru, a better approach appears to be:
if(request.getCharacterEncoding() == null)
request.setCharacterEncoding("UTF-8");
paramValue = request.getParameter("paramKey");
Basically, the servlet engine needs to be told to retrieve parameters using UTF-8, as browsers don't always send accurate information as to what encoding is being used in a form submission.

update:12/05/2006 Anonymous poster below points us to this discussion about setCharacterEncoding having no effect

4 comments:

Anonymous said...

I think you'll find that Tomcat, for one, ignores setCharacterEncoding as far as interpreting request parameters goes.

See http://issues.apache.org/bugzilla/buglist.cgi?query_format=specific&order=relevance+desc&bug_status=__closed__&product=Tomcat+5&content=getparameter+setcharacterencoding for examples of the discussions.

Unknown said...

s.getBytes("Cp1252");

Explanation: you getBytes uses the platform encoding of the server. One would like to use the W3C default ISO-8859-1. But ALL browsers interprete ISO-8859-1 as Cp1252 (even on Mac). Hence maybe better to use Windows-1252 aka Cp1252.
- Joop Eggen

Anonymous said...

Just read:
http://wiki.apache.org/tomcat/Tomcat/UTF-8
and port your application to UTF-8 like everyone else

Anonymous said...

Thanks. It runs!