Monday, September 11, 2006

ServletRequest.getParameter and UTF-8

When we get a parameter from an HTTP request in the Tomcat Servlet container, the String object returned isn't UTF-8-aware. If this bugs you down, you can work around it:

String value = request.getParameter("key");
if (value != null) {
try {
value = new String(value.getBytes(),"UTF-8");
} catch ( uee) {
//wrong encoding!
I don't know whether this is behavior expected from the HttpServlet spec.

update: 09/13/2006: the above code doesn't quite work. I'm now pretty sure it was a useless attempt, heh. getParameter URLdecodes the value for you, but it doesn't do it in a UTF-8-aware way, and my "workaround" can't possibly work around that limitation. duh.

You just want to request.getQueryString() to get the raw UTF-8 URL-encoded query string sent by the HTTP client. And then you want to manually extract the value you need from the key, and run it through,"UTF-8");

If you want to pass this value to an XSLT transformation parameter using Xalan, you'll also run into utf8-awareness limitation. You'll want to pass a UTF8-URLEncoded version of the $encodedValue to the transformation. Then inside the XSLT stylesheet, declare a variable like this: xsl:variable name="decodedValue" select="$encodedValue, 'UTF-8')" ... assuming you've enabled "java" as an extension by declaring its namespace.

update: 09/26/2006 Upon reading this article on UTF-8 and request.getParameter from jGuru, a better approach appears to be:
if(request.getCharacterEncoding() == null)
paramValue = request.getParameter("paramKey");
Basically, the servlet engine needs to be told to retrieve parameters using UTF-8, as browsers don't always send accurate information as to what encoding is being used in a form submission.

update:12/05/2006 Anonymous poster below points us to this discussion about setCharacterEncoding having no effect


Anonymous said...

I think you'll find that Tomcat, for one, ignores setCharacterEncoding as far as interpreting request parameters goes.

See for examples of the discussions.

Unknown said...


Explanation: you getBytes uses the platform encoding of the server. One would like to use the W3C default ISO-8859-1. But ALL browsers interprete ISO-8859-1 as Cp1252 (even on Mac). Hence maybe better to use Windows-1252 aka Cp1252.
- Joop Eggen

Anonymous said...

Just read:
and port your application to UTF-8 like everyone else

Anonymous said...

Thanks. It runs!