|
|
|
Alex you are correct in identifying the bug where the web work tag does not correctly respect character boundaries. I took you code, converted it to not use generics (webwork does not support 1.5 language level) and tested it via the units test in the atlassian patch. The problem with you patch is that its an order of magnitude slower than the previous code and also an order of magnitude slower than re-allocating and copying all the current bytes (via toByteArray()) and then converting them via new String(bytes,encoding) I originally tried a fix that would use the following rules for encoding chars * if there is only 1 byte buffer, encoding it directly via new String(bytes,encoding) * if there is less then 3 byte buffers, copy them into one big byte array and then encode it via via new String(bytes,encoding) * if there is more then 3 then use you CharsetDecoder approach. The good thing about this is that the a known amount of memory is used. In your "decoder" code you allocate an 8K ByteBuffer + a 16K CharBuffer (8K of chararacter and hence 16K of byte memory) so there needs to be more than 3 8K buffers in the list to be equal to allocating one big 32K buffer and System.arraycopy() the bytes across. This worked well but then I hit a snag. I tested FastByteArrayOutputStream a number of buffers sizes and a number of output sizes with a number of different encodings. I found an infinite loop bug in your code when testing large output sizes (> 100K) and UTF-16. It was caused by your inner decodeAndWriteBuffered() always wanting to encode more data (reporting overflow) but the byte buffer never grows in this function if more data is needed. It may have been because I used "odd" sized buffer sizes and I didn't have the time to completely debug it. The upshot is that the most common use case of FastByteArrayOutpuStream is for output of less than 8K and I think 95% of cases would be less than 32K. So its much faster to smoosh together via toByteArray() and encode it at once via new String(bytes,encoding). At most you need double the output byte size in memory terms but it runs extremely fast. As you why new String(bytes,encoding) is an order of magnitude faster than CharsetDecoder and nio ByteBuffer/CharBuffer translation I am not quite sure. I think its because under the covers significant CPU is involved in copy small bits off memory around (especially since your code calls in.compact()). Also I think the String() code can assume one long buffer and hence work that much faster. The upshot of all this is that submitted patch fixes the encoding problems (as demonstrated by the FastByteArrayOutputStreamTestCase) and its runs very fast when doing so. Cheers Brad Baker JIRA Developer this problem is also linked at http://jira.atlassian.com/browse/JRA-13866 Note that this has been committed to the 1.4 webwork branch.
Note also that the patch was for the 1.4 branch. Hi guys,
Thanks for reporting this. Brat's patch have been committed into webwork trunk and will be available in webwork 2.2.7 distribution (not release yet at the time of writing). Brat and Alex, thanks for your effort, really appreciate it. All the best to Atlassian as well. Cheers guys. | ||||||||||||||||||||||||||||||||||||||||||||||||
Please look attached file. The problem actually in follow:
For example, UTF-8 encoding can have symbols represented into 4 bytes. One character encoded by 4 bytes. And other some symbols encoded by 2 bytes. This fact make conversion from bytes to chars is more complicated that just spliting bytes by buffers per 8192 bytes each. for instance it can fill all 8190 byte by 2-byte character, and last 2 bytes filled by first 2 bytes from 4-bytes character. The one character is splited by two parts, and this is a make error, then you will convert from 8192 byte array to String, last character will not converted (ignored), and the first character in next buffer will encoded as wrong character.
I'm have very pure English, sorry. If you have some Russian speaking on your side, i can explain problem more complex.
Thanks!