Category Archives: Technology

Changing dynamic to static URLs

Search engine-friendly links with mod_rewrite

Introduction

One of the most frequent questions posted in the Apache Server forum is “How can I change my dynamic URLs to static URLs using mod_rewrite?” So this post is intended to answer that question and to clear up a very common misconception.

Mod_rewrite cannot “change” the URLs on your pages

First, the misconception: Mod_rewrite cannot be used to change the URL that the visitor sees in his/her browser address bar unless an external redirect is invoked. But an external redirect would ‘expose’ the underlying dynamic URL to search engines and would therefore completely defeat the purpose here. This application calls for an internal server rewrite, not an external client redirect.

It’s also important to realize that mod_rewrite works on requested URLs after the HTTP request is received by the server, and before any scripts are executed or any content is served. That is, mod_rewrite changes the server filepath and script variables associated with a requested URL, but has no effect whatsoever on the content of ‘pages’ output by the server.

How to change dynamic to static URLs

With that in mind, here’s the procedure to implement search engine-friendly static URLs on a dynamic site:

  • Change all URLs in links on all pages to a static form. This is usually done by modifying the database or by changing the script that generates those pages. PHP’s preg_replace function often comes in handy for this.
  • Add mod_rewrite code to your httpd.conf, conf.d, or .htaccess file to internally rewrite those static URLs, when requested from your server, into the dynamic form needed to invoke your page-generation script.
  • Add additional mod_rewrite code to detect direct client requests for dynamic URLs and externally redirect those requests to the equivalent new static URLs. A 301-Moved Permanently redirect is used to tell search engines to drop your old dynamic URLs and use the new static ones, and also to redirect visitors who may come back to your site using outdated dynamic-URL bookmarks.Considering the above for a moment, one quickly realizes that both the dynamic and static URL formats must contain all the information needed to reconstruct the other format. In addition, careful selection of the ‘design’ of the static URLs can save a lot of trouble later, and also save a lot of CPU cycles which might otherwise be wasted with an inefficient implementation.

    An earnest warning

    It is not my purpose here to explain all about regular expressions and mod_rewrite; The Apache mod_rewrite documentation and many other tutorials are readily available on-line to anyone who searches for them (see also the references cited in the Apache Forum Charter and the tutorials in the Apache forum section of the WebmasterWorld Library).

    Trying to use mod_rewrite without studying that documentation thoroughly is an invitation to disaster. Keep in mind that mod_rewrite affects your server configuration, and that one single typo or logic error can make your site inaccessible or quickly ruin your search engine rankings. If you depend on your site’s revenue for your livlihood, intense study is indicated.

    That said, here’s an example which should be useful for study, and might serve as a base from which you can customize your own solution.

    Working example

    Old dynamic URL format: /index\.php?product=widget&color=blue&size=small&texture=fuzzy&maker=widgetco

    New static URL format: /product/widget/blue/small/fuzzy/widgetco

    Mod_rewrite code for use in .htaccess file:

    # Enable mod_rewrite, start rewrite engine
    Options +FollowSymLinks
    RewriteEngine on
    #
    # Internally rewrite search engine friendly static URL to dynamic filepath and query
    RewriteRule ^product/([^/]+)/([^/]+)/([^/]+)/([^/]+)/([^/]+)/?$ /index.php?product=$1&color=$2&size=$3&texture=$4&maker=$5 [L]
    #
    # Externally redirect client requests for old dynamic URLs to equivalent new static URLs
    RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.php\?product=([^&]+)&color=([^&]+)&size=([^&]+)&texture=([^&]+)&maker=([^\ ]+)\ HTTP/
    RewriteRule ^index\.php$ http://example.com/product/%1/%2/%3/%4/%5? [R=301,L]

    Note that the keyword “product” always appears in both the static and dynamic forms. This is intended to make it simple for mod_rewrite to detect requests where the above rules need to be applied. Other methods, such as tesing for file-exists are also possible, but less efficient and more prone to errors compared to this approach.

    Differences between .htaccess code and httpd.conf or conf.d code

    If you wish to use this code in a container in the http.conf or conf.d server configuration files, you will need to add a leading slash to the patterns in both RewriteRules, i.e. change “RewriteRule ^index\.php$” to “RewriteRule ^/index\.php$”. Also remember that you will need to restart your server before changes in these server config files take effect.

    How this works

  • A visitor uses their browser to view one of your pages
  • The visitor clicks on the link <a href=”/product/gizmo/red/tiny/furry/gizmocorp”>Tiny red furry gizmos by GizmoCorp!</a> on your page
  • The browser requests the virtual file http://example.com/product/gizmo/red/tiny/furry/gizmocorp from your server
  • Mod_rewrite is invoked, and the first rule above rewrites the request to /index\.php?product=gizmo&color=red&size=tiny&texture=furry&maker=gizmocorp, invoking your script
  • Your script generates the requested page, and the server sends it back to the client browser
  • The visitor clicks on another link, and the process repeatsNow let’s say a search engine spider visits your site using the old dynamic URL:
  • The spider requests http://example.com/index\.php?product=wodget&color=green&size=large&texture=smooth&maker=wodgetsinc from your server
  • Mod_rewrite is invoked, and the second rule generates an external 301 redirect, informing the spider that the requested page has been permanently moved to http://example.com/product/wodget/green/large/smooth/wodgetsinc
  • The spider queues a request to its URL database manager, telling it to replace the old dynamic URL with the new one given in that redirect response.
  • The spider re-requests the page it was looking for using the new static URL http://example.com/product/wodget/green/large/smooth/wodgetsinc
  • Mod_rewrite is invoked, and the first rule internally rewrites the request to /index\.php?product=wodget&color=green&size=large&texture=smooth&maker=wodgetsinc, invoking your script
  • Your script generates the requested page, and the server sends it back to the search engine spider for parsing and inclusion in the search index
  • Since the spider is now collecting pages including new static links, and all requests for old dynamic URLs are permanently redirected to the new static URLs, the new URLs will replace the old ones in search results over time.Location, location, location

    In order for the code above to work, it must be placed in the .htaccess file in the same directory as the /index.php file. Or it must be placed in a <directory> container in httpd.conf or conf.d that refers to that directory. Alternatively, the code can be modified for placement in any Web-accessible directory above the /index.php directory by changing the URL-paths used in the regular-expressions patterns for RewriteCond and RewriteRule.

    Regular-expressions patterns

    Just one comment on the regular expressions subpatterns used in the code above. I have avoided using the very easy, very popular, and very inefficient construct “(.*)/(.*)” in the code. That’s because multiple “.*” subpatterns in a regular-expressions pattern are highy ambiguous and highly inefficient.

    The reason for this is twofold; First, “.*” means “match any number of any characters”. And second, “.*” is ‘greedy,’ meaning it will match as many characters as possible. So what happens with a pattern like “(.*)/(.*)” is that multiple matching attempts must be made before the requested URL can match the pattern or be rejected, with the number of attempts equal to (the number of characters between “/” and the end of the requested URL plus two) multiplied by (the number of “(.*)” subpatterns minus one) — It is easy to make a multiple-“(.*)” pattern that requires dozens or even hundreds of passes to match or reject a particular requested URL.

    Let’s take a short example. Note that the periods are used only to force a ‘table’ layout on this forum. Bearing in mind that back-reference $1 contains the characters matched into the first parenthesized sub-pattern, while $2 contains those matched into the second sub-pattern:

    Requested URL: http://example.com/abc/def
    Local URL-path: abc/def
    Rule pattern: ^(.*)/(.*)$

    Pass# ¦ $1 value ¦ $2 value ¦ Result
    1 … ¦ abc/def .¦ – …… ¦ no match
    2 … ¦ abc/de . ¦ f …… ¦ no match
    3 … ¦ abc/d .. ¦ ef ….. ¦ no match
    4 … ¦ abc/ … ¦ def …. ¦ no match
    5 … } abc …. ¦ def …. ¦ Match

    I’ll hazard a guess that many many sites are driven to unnecessary server upgrades every year by this one error alone.

    Instead, I used the unambiguous constructs “([^/]+)”, “([^&]+)”, and “([^\ ]+)”. Roughly translated, these mean “match one or more characters not equal to a slash,” “match one or more characters not equal to an ampersand,” and “match one or more characters not equal to a space,” respectively. The effect is that each of those subpatterns will ‘consume’ one or more characters from the requested URL, up to the next occurance of the excluded character, thereby allowing the regex parser to match the requested URL to the pattern in one single left-to-right pass.

    Common problems

    A common problem encountered when implementing static-to-dynamic URL rewrites is that relative links to images and included CSS files and external JavaScripts on your pages will become broken. The key is to remember that it is the client (e.g. the browser) that resolves relative links; For example, if you are rewriting the URL /product/widget/blue/fuzzy/widgetco to your script, the browser will see a page called “widgetco”, and see a relative link on that page as being relative to the ‘virtual’ directory /product/widget/blue/fuzzy/. The two easiest solutions are to use server-relative or absolute (canonical) links, or to add additional code to rewrite image, CSS, and external JS URLs to the correct location. An example would be to use the server-relative link =”/logo.gif”> to replace the page-relative link <img src=”logo.gif”>.

    Avoiding testing problems

    For both .htaccess and server config file code, remember to flush your browser cache before testing any changes; Otherwise, your browser will likely serve any previously-requested pages from its cache instead of fetching them from your server. Obviously, in that case, no code on your server can have any effect on the transaction.

    Read first, then write and test

    I hope this post is helpful. If you still have problems after studying the mod_rewrite documentation and regular expressions tutorials, and writing and testing your own code, feel free to post relevant entries from your server error log and ask specific questions in the Apache Server forum. Please take a few minutes to read the WebmasterWorld Terms of Service and the Apache Forum Charter before posting (Thanks!).

  • JVM Tuning

    Better performance in production servers is possible with proper configuration of JVM parameters, particularily those related to memory usage and garbage collection.

    1. Heap size
      1. Heap size does not determine the amount of memory your process uses
    2. Garbage collection
    3. Stack size
    4. Monitoring the JVM
    Heap size

    The allocation of memory for the JVM is specified using -X options when starting Resin (the exact options may depend upon the JVM that you are using, the examples here are for the Sun JVM).

    JVM option passed to Resin Meaning
    -Xms initial java heap size
    -Xmx maximum java heap size
    -Xmn the size of the heap for the young generation
    Resin startup with heap memory options
    unix> bin/httpd.sh -Xmn100M -Xms500M -Xmx500M
    win> bin/httpd.exe -Xmn100M -Xms500M -Xmx500M
    install win service> bin/httpd.exe -Xmn100M -Xms500M -Xmx500M -install

    It is good practice with server-side Java applications like Resin to set the minimum -Xms and maximum -Xmx heap sizes to the same value.

    For efficient garbage collection, the -Xmn value should be lower than the -Xmx value.

    Heap size does not determine the amount of memory your process uses

    If you monitor your java process with an OS tool like top or taskmanager, you may see the amount of memory you use exceed the amount you have specified for -Xmx. -Xmx limits the java heap size, java will allocate memory for other things, including a stack for each thread. It is not unusual for the total memory consumption of the VM to exceed the value of -Xmx.

    Garbage collection

    (thanks to Rob Lockstone for his comments)

    There are essentially two GC threads running. One is a very lightweight thread which does “little” collections primarily on the Eden (a.k.a. Young) generation of the heap. The other is the Full GC thread which traverses the entire heap when there is not enough memory left to allocate space for objects which get promoted from the Eden to the older generation(s).

    If there is a memory leak or inadequate heap allocated, eventually the older generation will start to run out of room causing the Full GC thread to run (nearly) continuously. Since this process “stops the world”, Resin won’t be able to respond to requests and they’ll start to back up.

    The amount allocated for the Eden generation is the value specified with -Xmn. The amount allocated for the older generation is the value of -Xmx minus the -Xmn. Generally, you don’t want the Eden to be too big or it will take too long for the GC to look through it for space that can be reclaimed.

    See also:

    Stack size

    Each thread in the VM get’s a stack. The stack size will limit the number of threads that you can have, too big of a stack size and you will run out of memory as each thread is allocated more memory than it needs.

    The Resin startup scripts (httpd.exe on Windows, wrapper.pl on Unix) will set the stack size to 2048k, unless it is specified explicity. 2048k is an appropriate value for most situations.

    JVM option passed to Resin Meaning
    -Xss the stack size for each thread

    -Xss determines the size of the stack: -Xss1024k. If the stack space is too small, eventually you will see an exception class java.lang.StackOverflowError .

    Some people have reported that it is necessary to change stack size settings at the OS level for Linux. A call to ulimit may be necessary, and is usually done with a command in /etc/profile:

    Limit thread stack size on Linux
    ulimit -s 2048
    Monitoring the JVM

    JDK 5 includes a number of tools that are useful for monitoring the JVM. Documentation for these tools is available from the Sun website. For JDK’s prior to 5, Sun provides the jvmstat tools .

    The most useful tool is jconsole. Details on using jconsole are provided in the Administration section of the Resin documentation.

    jconsole
    win> ./httpd.exe -Dcom.sun.management.jmxremote
    unix> bin/httpd.sh -Dcom.sun.management.jmxremote
    
     ... in another shell window ... 
    
    win> jconsole.exe
    unix> jconsole
    
    Choose Resin's JVM from the "Local" list.

    jps and jstack are also useful, providing a quick command line method for obtaining stack traces of all current threads. Details on obtaining and interpreting stack traces is in the Troubleshooting section of the Resin documentation.

    Source: http://www.caucho.com/

    Server Caching

    Server caching can speed dynamic pages to near-static speeds. Many pages created by database queries only change every 15 minutes or so, e.g. CNN or Slashdot. Resin can cache the results and serve them like static pages. Resin’s caching will work for any servlet, including JSP and XTP pages. It depends only on the headers the servlet returns in the response.

    By default, pages are not cached. To cache, a page must set a HTTP caching header.

    Resin’s caching operates like a proxy cache. It’s controlled by the same HTTP headers as any proxy cache. Every user shares the same cached page.

    1. Cache-Control: max-age
    2. Debugging caching
    3. Expires
    4. If-Modified
    5. Servlets
    6. Sessions and Cookies
    7. Included Pages
    8. Vary
    9. Caching Anonymous Users
    10. Experimental Anonymous Caching
    11. cache-mapping
    Cache-Control: max-age

    Setting the max-age header will cache the results for a specified time. For heavily loaded pages, even setting short expires times can significantly improve performance.

    Note, pages using sessions should not be cached, although more sophisticated headers like “Cache-Control: private” can specify caching only for the session’s browser.

    The following example sets expiration for 15 seconds. So the counter should update slowly.

    Expires
    <%@ page session="false" %>
    <%! int counter; %>
    <%
    response.addHeader("Cache-Control", "max-age=15");
    %>
    Count: <%= counter++ %>

    max-age is useful for database generated pages which are continuously, but slowly updated. To cache based on something with a known modified date, like a file, you can use If-Modified.

    Debugging caching

    When designing and testing your cached page, it’s important to see how Resin is caching the page. To turn on logging for caching, you’ll add the following to your resin.conf:

    adding caching log
    <log name="com.caucho.server.cache"
         path="log/cache.log"
         level="fine"/>

    The output will look something like the following:

    [10:18:11.369] caching: /images/caucho-white.jpg etag="AAAAPbkEyoA" length=6190
    [10:18:11.377] caching: /images/logo.gif etag="AAAAOQ9zLeQ" length=571
    [10:18:11.393] caching: /css/default.css etag="AAAANzMooDY" length=1665
    [10:18:11.524] caching: /images/pixel.gif etag="AAAANpcE4pY" length=61
    
    ...
    
    [2003/09/12 10:18:49.303] using cache: /css/default.css
    [2003/09/12 10:18:49.346] using cache: /images/pixel.gif
    [2003/09/12 10:18:49.348] using cache: /images/caucho-white.jpg
    [2003/09/12 10:18:49.362] using cache: /images/logo.gif
    Expires

    An application can also set the Expires header to enable caching, when the expiration date is a specific time instead of an interval. For heavily loaded pages, even setting short expires times can significantly improve performance. Sessions should be disabled for caching.

    The following example sets expiration for 15 seconds. So the counter should update slowly.

    Expires
    <%@ page session="false" %>
    <%! int counter; %>
    <%
    long now = System.currentTimeMillis();
    response.setDateHeader("Expires", now + 15000);
    %>
    Count: <%= counter++ %>

    Expires is useful for database generated pages which are continuously, but slowly updated. To cache based on something with a known modified date, like a file, you can use If-Modified.

    If-Modified

    The If-Modified headers let you cache based on an underlying change date. For example, the page may only change when an underlying source page changes. Resin lets you easily use If-Modified by overriding methods in HttpServlet or in a JSP page.

    The following page only changes when the underlying ‘test.xml’ page changes.

    <%@ page session="false" %>
    <%!
    int counter;
    
    public long getLastModified(HttpServletRequest req)
    {
      String path = req.getRealPath("test.xml");
      return new File(path).lastModified();
    }
    %>
    Count: <%= counter++ %>

    If-Modified pages are useful in combination with the cache-mapping configuration.

    Servlets

    Caching servlets is exactly like caching JSP pages (or XTP or static files.) Resin’s caching mechanism works like a proxy cache: it don’t care how the page is generated; as long as the proper caching headers are set, the page will be cached.

    package test;
    
    import javax.servlet.*;
    import javax.servlet.http.*;
    import java.io.*;
    
    public class TestServlet extends HttpServlet {
      int counter;
    
      public long getLastModified(HttpServletRequest req)
      {
        String path = req.getRealPath("test.xml");
        return new File(path).lastModified();
      }
    
      public void doGet(HttpServletRequest req,
                        HttpServletResponse res)
        throws IOException, ServletException
      {
        PrintWriter out = res.getWriter();
    
        out.println("Count: " + counter++);
      }
    }
    Sessions and Cookies

    Because Resin follows the HTTP caching spec, setting caching on a page which relies on sessions will make the page cacheable. You can test this by hitting the following page with separate browsers.

    caching session information
    <% response.addHeader("Cache-Control", "max-age=3600"); %>
    
    session id <%= session.getId() %>

    Normally, this behavior is not what you want. Instead, you may want the browser to cache the page, but not let other browsers see the same page. To do that, you’ll set the “Cache-Control: private” header. You’ll need to use addHeader, not setHeader, so the browser will get both “Cache-Control” directives.

    private(browser) caching
    <%
      response.addHeader("Cache-Control", "max-age=3600");
      response.addHeader("Cache-Control", "private");
    %>
    
    session id <%= session.getId() %>
    Included Pages

    Resin can cache subpages even when the top page can’t be cached. Sites allowing user personalization will often design pages with jsp:include subpages. Some subpages are user-specific and can’t be cached. Others are common to everybody and can be cached.

    Resin treats subpages as independent requests, so they can be cached independent of the top-level page. Try the following, use the first expires counter example as the included page. Create a top-level page that looks like:

    <% if (! session.isNew()) { %>
    

    Welcome back!

    <% } %> <jsp:include page="expires.jsp"/>
    Vary

    In some cases, you’ll want to have separate cached pages for the same URL depending on the capabilities of the browser. Using gzip compression is the most important example. Browsers which can understand gzip-compressed files receive the compressed page while simple browsers will see the uncompressed page. Using the “Vary” header, Resin can cache different versions of that page.

    caching based on gzip
    <%
      response.addHeader("Cache-Control", "max-age=3600");
      response.addHeader("Vary", "Accept-Encoding");
    %>
    
    Accept-Encoding: <%= request.getHeader("Accept-Encoding") %>
    Caching Anonymous Users

    In many cases, logged in users get specialized pages, but anonymous users all see the same page. In this case, you can still take advantage of Resin’s caching, but you’ll need to do a little work in your design.

    First, you’ll need to create an include() subpage that contains the common page. The top page can’t be cached because it depends on whether a user is logged in or not.

    You must use include() because forward() is cached just like the top page. The top page isn’t cacheable because of the user login, so the forwarded page isn’t cacheable either.

    Here’s what a sample subpage might look like:

    <%@ page session=false %>
    <%! int counter; %>
    <%
    long now = System.currentTimeMillis();
    response.setDateHeader("Expires", now + 15000);
    
    String user = request.getParameter("user");
    %>
    User: <%= user %> <%= counter++ %>

    The top page slightly trickier because it needs to pass the user to the subpage. You need to pass a unique id. If you pass a boolean logged-in parameter, all logged in users will see the same page.

    <%@ page session=true %>
    <%
    String user = getSomeSortOfUniqueUserId();
    if (user == null)
      user = "Anonymous";
    %>
    
    ...

    Of course, the top-level page could also be a servlet:

    ...
    
    String user = getSomeSortOfUniqueUserId(request);
    if (user == null)
      user = "Anonymous";
    
    RequestDispatcher disp;
    disp = request.getRequestDispatcher("/cachedpage.jsp?user=" + user);
    
    disp.include(request, response);
    Experimental Anonymous Caching

    Resin includes an anonymous user caching feature. If a user is not logged in, she will get a cached page. If she’s logged in, she’ll get her own page. This feature will not work if anonymous users are assigned cookies for tracking purposes.

    To make anonymous caching work, you must set the Cache-Control: x-anonymous header. If you omit the x-anonymous header, Resin will use the Expires to cache the same page for every user.

    <%@ page session="false" %>
    <%! int counter; %>
    <%
    response.addHeader("Cache-Control", "max-age=15");
    response.addHeader("Cache-Control", "x-anonymous");
    
    String user = request.getParameter("user");
    %>
    User: <%= user %> <%= counter++ %>

    The top page must still set the Expires or If-Modified header, but Resin will take care of deciding if the page is cacheable or not. If the request has any cookies, Resin will not cache it or use the cached page. If it has no cookies, Resin will use the cached page.

    When using x-anonymous, user tracking cookies will make the page uncacheable even if the page is the same for all users. Resin chooses to cache or not based on the existence of any cookies in the request, whether they’re used or not.

    cache-mapping

    cache-mapping assigns a max-age and Expires to an If-Modified cacheable page. It does not affect max-age or Expires cached pages. The FileServlet takes advantage of cache-mapping because it’s an If-Modified servlet.

    Often, you want a long Expires time for a page to a browser. For example, any gif will not change for 24 hours. That keeps browsers from asking for the same gif every five seconds; that’s especially important for tiny formatting gifs. However, as soon as that page or gif changes, you want the change immediately available to any new browser or to a browser using reload.

    Here’s how you would set the Expires to 24 hours for a gif, based on the default FileServlet.

    <web-app id='/'>
      <cache-mapping url-pattern='*.gif'
                     expires='24h'/>
    </web-app>

    The cache-mapping automatically generates the Expires header. It only works for cacheable pages setting If-Modified or ETag. It will not affect pages explicily setting Expires or non-cacheable pages. So it’s safe to create a cache-mapping for *.jsp even if only some are cacheable.

    Source: http://www.caucho.com/