History | Log In     View a printable version of the current page.  
Issue Details (XML | Word | Printable)

Key: WW-1495
Type: Bug Bug
Status: Resolved Resolved
Resolution: Fixed
Priority: Major Major
Assignee: tm_jee
Reporter: Brad Baker
Votes: 1
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
WebWork

FastByteArrayOutputStream has an encoding bug

Created: 09/Jan/08 09:35 PM   Updated: 18/Jan/08 08:12 PM
Component/s: None
Affects Version/s: 1.4, 2.2, 2.1.7, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.2.5, 2.2.6
Fix Version/s: 2.2.7

File Attachments: 1. GZip Archive FastByteArrayOutputStream-atlassian-patch.tar.gz (5 kb)
2. Java Source File FastByteArrayOutputStream.java (8 kb)


Flags: Important, Patch


 Description  « Hide

If multi byte character sets are used AND the output data is greater than 8K, then FastByteArrayOutputStream fails to respect the boundaries
of the characters.

It does this because it "encodes" the list of buffers as if they are mutually exclusive when in fact they arent when multi byte characters are in play.

This affects all web work tags derived from ComponentTagl, as they use FastByteArrayOutputStream to buffer tag output.

I have attached a patch for this problem. It uses a 2 rule heuristic.

If there is a single buffer, it encodes just that buffer via new String(bytes,encoding)

if there is multiple buffers, it smooshes them togther (via allocating and System.arraycopy) and then encodes them via new String(allbytes, encoding)

This is a bug in both webwork-1 (which we use in [Atlassian JIRA | www.atlassian.com/jira] and also in webwork-2.

As far as I can tell the scope of the problem is in derivations of ComponentTag (TextAreaTag, TextTag and so on)




 All   Comments   Change History      Sort Order:
Alexey Efimov - [10/Jan/08 04:57 AM ]
I'll see the patch you attached, and find that it does not used CharBuffer. In fact, this path will not work also good, as patch in original JIRA issue (http://jira.atlassian.com/browse/JRA-13866).

Please look attached file. The problem actually in follow:
For example, UTF-8 encoding can have symbols represented into 4 bytes. One character encoded by 4 bytes. And other some symbols encoded by 2 bytes. This fact make conversion from bytes to chars is more complicated that just spliting bytes by buffers per 8192 bytes each. for instance it can fill all 8190 byte by 2-byte character, and last 2 bytes filled by first 2 bytes from 4-bytes character. The one character is splited by two parts, and this is a make error, then you will convert from 8192 byte array to String, last character will not converted (ignored), and the first character in next buffer will encoded as wrong character.

I'm have very pure English, sorry. If you have some Russian speaking on your side, i can explain problem more complex.

Thanks!

Brad Baker - [10/Jan/08 05:58 PM ]

Alex you are correct in identifying the bug where the web work tag does not correctly respect character boundaries.

I took you code, converted it to not use generics (webwork does not support 1.5 language level) and tested it via the units test in the atlassian patch.


The problem with you patch is that its an order of magnitude slower than the previous code and also an order of magnitude slower than re-allocating and copying all the current bytes (via toByteArray()) and then converting them via new String(bytes,encoding)

I originally tried a fix that would use the following rules for encoding chars

 * if there is only 1 byte buffer, encoding it directly via new String(bytes,encoding)
 * if there is less then 3 byte buffers, copy them into one big byte array and then encode it via via new String(bytes,encoding)
 * if there is more then 3 then use you CharsetDecoder approach.

The good thing about this is that the a known amount of memory is used. In your "decoder" code you allocate an 8K ByteBuffer + a 16K CharBuffer (8K of chararacter and hence 16K of byte memory) so
there needs to be more than 3 8K buffers in the list to be equal to allocating one big 32K buffer and System.arraycopy() the bytes across.

This worked well but then I hit a snag. I tested FastByteArrayOutputStream a number of buffers sizes and a number of output sizes with a number of different encodings. I found an infinite loop bug in your code
when testing large output sizes (> 100K) and UTF-16. It was caused by your inner decodeAndWriteBuffered() always wanting to encode more data (reporting overflow) but the byte buffer never grows in this function if more
data is needed. It may have been because I used "odd" sized buffer sizes and I didn't have the time to completely debug it.

The upshot is that the most common use case of FastByteArrayOutpuStream is for output of less than 8K and I think 95% of cases would be less than 32K.

So its much faster to smoosh together via toByteArray() and encode it at once via new String(bytes,encoding). At most you need double the output byte size in memory terms but it runs extremely fast.

As you why new String(bytes,encoding) is an order of magnitude faster than CharsetDecoder and nio ByteBuffer/CharBuffer translation I am not quite sure. I think its because under the covers significant CPU is involved in copy small bits off memory around (especially since your code calls in.compact()). Also I think the String() code can assume one long buffer and hence work that much faster.

The upshot of all this is that submitted patch fixes the encoding problems (as demonstrated by the FastByteArrayOutputStreamTestCase) and its runs very fast when doing so.


Cheers

Brad Baker
JIRA Developer

this problem is also linked at http://jira.atlassian.com/browse/JRA-13866

Scott Farquhar - [16/Jan/08 04:54 AM ]
Note that this has been committed to the 1.4 webwork branch.

Note also that the patch was for the 1.4 branch.

tm_jee - [18/Jan/08 08:06 PM ]
Hi guys,

Thanks for reporting this. Brat's patch have been committed into webwork trunk and will be available in webwork 2.2.7 distribution (not release yet at the time of writing).

Brat and Alex, thanks for your effort, really appreciate it. All the best to Atlassian as well.

Cheers guys.

tm_jee - [18/Jan/08 08:12 PM ]
This was reported on 9th Jan and only get it into trunk at 18 Jan, what a delay... shame on me :-( ... Should try to give a faster response next time. Cheers guys.