Issue Details (XML | Word | Printable)

Key: CACHE-49
Type: New Feature New Feature
Status: Open Open
Priority: Major Major
Assignee: Lars Torunski
Reporter: Paul Rivers
Votes: 5
Watchers: 7
Operations

If you were logged in you would be able to see more operations.
OSCache

Add Http1.1 Compression (GZip) and increase efficiency

Created: 04/Aug/03 03:46 PM   Updated: 22/Oct/06 10:36 AM
Component/s: Filters
Affects Version/s: None
Fix Version/s: None

File Attachments: 1. Java Source File CacheFilter.java (10 kB)
2. Java Source File CacheFilter.java (10 kB)
3. Java Source File CacheHttpServletResponseWrapper.java (8 kB)
4. Text File GZIP.patch (7 kB)
5. Java Source File ResponseContent.java (5 kB)

Issue Links:
Related
 


 Description  « Hide
By adding support for compressing web pages using gzip to OSCache, you can:
1. Decrease the memory used by OSCache to cache web pages.
2. In situations where the cache is infrequently invalidated, reduce the amount of work the webserver does - smaller files means the webserver sends less data, and which takes less work.
2. Of course, increased load time for web pages for the end user and less use of bandwidth.

As you may know, all http1.1 compliant browsers must be capable of reading content compressed using gzip (At least that's my understanding - it's mentioned at http://www.w3.org/Protocols/rfc2068/rfc2068) This seems to reduce the size of most webpages to about 33% of their original size. For example, the OSCache homepage compresses to 33.5% of its original size, going from an estimated download time of 2.2 seconds to 0.7 seconds. The java.sun.com homepage can be reduced to 14.8% of its original size if it was gzip compressed (Source: http://leknor.com/code/gziped.php). Mozilla also did their own compression testing: http://www.mozilla.org/projects/apache/gzip/

The problem is that using gzip compression normally increases the load on the server - each time a page is requested, it must be gzipped. With static html pages, it is possible to cache the compressed page, so it isn't compressed over and over again. Obviously, that won't work with dynamic content - but since anyone using OSCache IS managing a cache of their pages, this is the perfect place for it! It also doesn't require a lot of extra coding - the jdk already has classes that perform gzip compression for you. In fact, tomcat ships with code to do compression on a web page as example code (Tomcat 4.1\webapps\examples\WEB-INF\classes\compressionFilters).

Of course, only certain types of content benefit from and should be compressed - html, htm, xml are good, jpg and gif shouldn't be compressed.

I think enabling gzip should be configurable by the user - use gzip for requests with certain content types (best solution), file extensions, url's matching a regular expression, or pick and choose individual files.

The only major problem I can think of is - What should OSCache do when a request comes in that doesn't support gzip (you can check the headers for a web page request to determine if it supports gzip compression)? I know of 3 possibilites:
1. Send an error - who uses an http1.0 browser anyways? ;-)
2. Decompress the page on the server and send it - http1.0 requests are the exception. This seems like the most sensible default behavior.
3. Keep a cache of both the compressed and uncompressed pages. I think this should be available as an option, but not the default. You would actually use more memory this way.

 All   Comments   Change History      Sort Order: Ascending order - Click to sort in descending order
Chris Miller added a comment - 04/Aug/03 04:36 PM
Nice idea, the caching and compression make for a pretty compelling combination.

Of course you'd gain most of these benefits simply by adding one of the many freely available GZIP filters after the OSCache filter, although that admittedly causes problems with http 1.0 clients (or I think more correctly, clients that don't have "gzip" in their Accept-Encoding header).

Initially my reaction was that this funtionality was out of scope, but given the problems surrounding clients that can't handle gzip, plus the fact that OSCache is all about performance anyway, this sounds like a pretty useful addition. Hopefully we can add it shortly after 2.0 is released.

Paul Rivers added a comment - 07/Aug/03 12:03 AM
What's great about it is that it doesn't just add compression, it will reduce the memory overhead of OSCache (among other benefits), thus it benefits OSCache to include it. :-)

Paul Rivers added a comment - 14/Aug/03 02:52 PM
"XCache 2.2 ($4500 direct)...Other appealing features include granular administration features over what gets cached, and Gzip, which compresses white spaces in pages for greater performance over low bandwidths"

http://www.pcmag.com/print_article/0,3048,a=46124,00.asp

:-) Looking forward to 2.1!

Lars Torunski added a comment - 28/Mar/04 10:43 AM
Option (3) "Keep a cache of both the compressed and uncompressed pages" should be the default caching behavior.

Http 1.1 browsers are common used, for that reason http 1.0 requests are rarely and memory usage is small.

Furthermore the caching algorithm will throw away "http 1.0 cache objects", if requests without gzip compression are lesser than gzip enabled requests.

(see http://www.faqs.org/rfcs/rfc1952.html)

Lars Torunski added a comment - 23/Jan/05 11:59 PM
1.)

ONJava.com: http://www.onjava.com/pub/a/onjava/2003/11/19/filters.html


2.)

Documented examples of browsers that claim to handle gzip but can't:

# Netscape 4.x has some problems...
|BrowserMatch ^Mozilla/4 gzip-only-text/html
|# Netscape 4.06-4.08 have some more problems
|BrowserMatch ^Mozilla/4\.0[678] no-gzip
|# MSIE masquerades as Netscape, but it is fine
|BrowserMatch \bMSIE !no-gzip !gzip-only-text/html |

At first we probe for a |User-Agent| string that indicates a Netscape Navigator version of 4.x. These versions cannot handle compression of types other than |text/html|. The versions 4.06, 4.07 and 4.08 also have problems with decompressing html files. Thus, we completely turn off the deflate filter for them.

The third |BrowserMatch <../mod/mod_setenvif.html#browsermatch>| directive fixes the guessed identity of the user agent, because the Microsoft Internet Explorer identifies itself also as "Mozilla/4" but is actually able to handle requested compression. Therefore we match against the additional string "MSIE" (|\b| means "word boundary") in the |User-Agent| Header and turn off the restrictions defined before.

Taken from http://httpd.apache.org/docs-2.0/mod/mod_deflate.html

Michael Greer added a comment - 28/Feb/05 12:55 AM
I hope this is configurable, since I have run into many issues with a bug in some versions of IE's gzip compression. It is fixed, but there are folks out there with it. What happens is that while it decompresses web pages fine, it _fails_ to decompress content for plugins. If you are serving data to a Flash app, the app crashes or hangs.
Many sites with dynamic Flash disable gzip completely, since it is impossible to detect this specific version from the headers.
Anyway, nice work, and please make it optional.

Fernando Martins added a comment - 20/Mar/05 05:36 PM
I was needed this functionality for some project, so I implemented a simple version which allows a filter gzip (not included) to compress the pages.
I will attach the files, they are modified versions of CVS HEAD as of 20/Mar.

Changes are as follow:

1) ResponseContent holds a contentEncoding which is set by CacheHttpServletResponseWrapper when such header is set or added.
2) CacheFilter will check (when is serving from cache) if client accepts gzip encoding, and pass this as a boolean to ResponseContent.writeTo(...)
3) ResponseContent.writeTo(...) will check if the contentEnconding it holds is set as gzip AND if the client supports gzip (passed as an argument from CacheFilter), in this case the header for gzip encoding is added to the response.
If the client doesn't support GZIP the content is uncompressed and sent, thus going for option 2 from the original Issue description, since this will save memory and is the most appropriate approach for my usecase.

Feel free to use and improve the code as you please.


Fernando Martins added a comment - 20/Mar/05 05:38 PM
It would be great if this gets release a little earlier then 2.3

Lars Torunski added a comment - 21/Mar/05 12:21 AM
Fernando,
1. in CacheFilter you catch all exceptions, hence they won't be propagated outside the CacheFilter now. This changes the behavior of the filter. Why do you catch all exceptions?
2. the gzip compression should be enabled by a filter parameter

Lars Torunski added a comment - 21/Mar/05 01:23 AM
Furthermore the check

            String acceptEncoding = ((HttpServletRequest) request).getHeader("Accept-Encoding");

            if ((acceptEncoding != null) && (acceptEncoding.indexOf("gzip") != -1)) {
                supportGzip = true;
            }

should be refactored to a new method "isGZipSupported(HttpServletRequest request)", hence the browser and the requested content can be checked.

Fernando, can you provide a patch for that?

Fernando Martins added a comment - 21/Mar/05 04:57 AM
Lars,

1. Good point, I forgot to remove it.
The reason I added there was because I was noticing the following behaviour:
some generated content (images) were not being cached by the filter, and I was not getting any exception so I had no idea for what reason this was not getting cached, even though the images were getting generated correctly.
So after having put there a log.warn(e) I found out that I was (incorrectly) closing the response outputstream in my servlet (after outputing the generated image), and since ResponseContent also does a close() on the outputStream there was a IOException being thrown complaining about this.
I'm not sure if that kind of Exceptions should be logged somewhere else (maybe depends on the container), and the CacheFilter should not care about this and just let transparently propagate the exceptions, or if it should log them and propagate further, something like:

            } catch (ServletException e) {
                log.info("<cache>: Did not cache entry with key "+key+" reason:"+e.getMessage());
                throw e;
            } catch (IOException e) {
                log.info("<cache>: Did not cache entry with key "+key+" reason:"+e.getMessage());
                throw e;

What's your opinion?

2) It's not necessary to have a gzip filter parameter, since CacheFilter is not doing any compression it could be somewhat confusing.
The only thing it is doing is to check if the client browser supports it, and actually this test is redundant if no gzip filter is being used, so I just changed the code so that the checking for gzip support is done only in the case when the content is actually compressed.
So if you're not using a GZIP filter, the check for GZIP support will not be done.

3) I added a protected method isGZipSupported(HttpServletRequest request), so one can subclass the filter and add additional checks for browser agent, etc.

will attach the patch

Lars Torunski added a comment - 21/Mar/05 01:21 PM
1. I'm wondering why this wasn't done before. Maybe there is a reason for it.

2. We have to check if the page is a fragment of a page, so please change

            if (respContent.isContentGziped()) {
                supportGZIP = isGZIPSupported((HttpServletRequest)request);
            }

to

            supportGZIP = respContent.isContentGziped() && isGZIPSupported((HttpServletRequest)request) && !fragmentRequest;

or move up your lines to the SC_NOT_MODIFIED section.

Furthermore we are looking for a full gzip compression implementation, hence we need a gzip filter parameter, because e.g. the user should decide that the delivered pdf don't need any further compression.
Can you implement the rest of the gzip compression?

3. OK

Can you please provide the CacheFilter instead of the patch file? Thanks.

Fernando Martins added a comment - 21/Mar/05 01:47 PM
I will post a new CacheFilter with the fragmentRequest check.
Do you want full new versions of the other 2 files, or can you use the patch for that?

About the gzip filter:
I think this is something separated.
So if you want in the future to provide a Gzip filter in oscache with functionality to exclude pdf and other stuff, great, but that doesn't belong in the CacheFilter itself.
There are free gzip filters around that one can use, independently of any caching.
But nevertheless, it should work with troubles if I use it with my own private gzip filter.
My changes are just to allow gzip filters to work correctly with cache filter.
So independently if oscache will provide out of the box a gzip filter or not, I believe these changes are necessary either way.
I can later on submit a gzip filter for oscache, but I think the first step is to have the cache filter be aware of the potential existence of gzip filters down the filter chain, and behave appropriately in those cases.




Fernando Martins added a comment - 21/Mar/05 01:48 PM
version 2

Lars Torunski added a comment - 21/Mar/05 04:33 PM
The sources from Fernando Martins were moved to CACHE-155 (Support of GZip filters in the filter chain), because this issue contains a full gzip compression support.

Lars Torunski added a comment - 21/Mar/05 04:36 PM
Fernando, I modified your sources and added them to the CVS head. I changed some names and if statements, so please take a look at the new sources and approve them. Thanks Lars

Lars Torunski added a comment - 03/Jul/05 01:58 AM
This issue should be changed to "Support GZIP compression filter libraries"

E.g.:
- http://sourceforge.net/projects/pjl-comp-filter
- http://opensource.atlassian.com/projects/spring/browse/SPR-787

The CacheFilter class can support different filters without implementing the GZIP stuff

Simone Avogadro added a comment - 06/Jul/05 08:54 AM
We are currently working on this too, but we are taking the following approach:
- GZip filter & Cache filter should be independent, we will use Caucho's Resin (www.caucho.com) own GZip filter, this way we delegate the broken-browser-support
- the "Accept-Encoding" header will be part of the cache's entry key
- the cache entry will hold the Response-Encoding value

let me know if I can submit a patch for this and from which sources may I start

Lars Torunski added a comment - 06/Jul/05 04:31 PM
We should implement this issue after the changes of the 2.2. release proved to be stable. Andres can provide a special branch for your changes.

Do you want to cache two responses for the same page, when one request accepts gzip encoding and the 2nd request doesn't support compressing?

Simone Avogadro added a comment - 07/Jul/05 05:36 AM
Yes, basically I would want to be able to cache separate reposnes based upon the following headers:
* Accept-Encoding
* Accept-Charset
* Accept-Language
* Accept

this way we would also resolve the annoing problem discussed when:
Client 1: GET /go/Home, Accept-Charset: UTF-8 => cache /go/Home with response charset UTF-8
Client 2: GET /go/Home, , Accept-Charset: ISO-8859-1 => response from cache is UTF-8, browser gets confused :-(

basically I would use all the Accept-* headers to form the key to cache, this is obviously redudant, but ensures reponse consistency

as to getting a special branch we can live with this and make this branch separately in house, but I feel the need to limit the number of differences between our current internal branches (with the patches I proposed in various points) and he official one.
is there any shedule for 2.2?

Lars Torunski added a comment - 07/Jul/05 02:28 PM
Simone, your problem should be fixed with CACHE-38 in the 2.1.1 release.

My focus would be to avoid caching different responses for the same page. If a user accesses a cached object, I would try to convert it to a different charset or to compress it etc. Following I would cache this converted response. The advantage is that every user gets the same content based on his accepted charset, encoding etc.

We havn't any schedule for 2.2 yet. Andres has to fix his assigned issues first.

Simone Avogadro added a comment - 08/Jul/05 02:31 AM
I will try 2.1.1 ASAP, however we are willing to keep multiple cached versions of the objects
this is our conclusion after taking into account different problems and different solutions, which leaded to the multiple-entry approach as the only viable solution
here are the three main reasons:
- we are currently using Resin's GZipFilter in front of OS-Cache and this degrades a lto performances under heavy load. On the same machine (Athlon64 3GHz) we can server about 4000 simultaneous users _without_ gzip, but can barely handle 1000 with GZip. This is because GZip is CPU intensive and saturates the CPU, on the other end without GZIp we are simply serving content from memory, which is basically near-0 cost
- after lot of work on char encodings I came to beliving that on-the-fly conversion is both CPU-expensive and too complex: this is a matter to be left to application servers. So basically doing on-the-fly trans-encoding would have the following problems:
    - CPU load
    - lost content: incomplete encodings like WINDOWS-1521 or ISO-8859-1 will lose some content, if the response is cached using one of these encodings then it will be inpossible to recover the lost content
- language: sometimes the server will reply differently depending on the preferred language of the client, if we ignore this behaviour we lose some of the webserver's funtionality

Lars Torunski added a comment - 22/Oct/06 10:36 AM
http://sourceforge.net/projects/pjl-comp-filter/
Version 1.4.5, released on 2006-10-15 is a backport of 1.6.2 to JDK 1.4

Does anybody has positive experiences with this filter?

CACHE-155 was implemented in release 2.2 and if this filters works fine, I will document the configuration of this filter and close this issue as "Won't Fix".