Content and Design Tips

Make web pages for users, not for search engines

Create a useful, information-rich website. Write pages that clearly and accurately describe your content. Don't load pages with irrelevant words. Think about the words and phrases customers might use to search for information and make sure that your site includes those words within it.

Focus on text

Focus on the text on your site. Make sure that the TITLE and ALT tags you use on each page are descriptive and accurate. Since the Google Crawler doesn't recognize text contained in images, avoid using text in graphics. Instead, use descriptive text in the ALT tag of your graphics.

When linking to non-HTML documents, use strong descriptions within the web page you are linking from to describe the linked document's content. This prevents the client from downloading an entire document before they decide if it contains the information they are looking for.

Make your site easy to navigate

Make a site with a clear hierarchy of links to other content. Every page should be reachable from at least one link located elsewhere in the site. Offer a site map to your users with links that point to the important parts of your site. Keep the links on a given page to a reasonable number (fewer than 100).

Ensure that your site is linked

Ensure that your site is linked from all relevant content and other sites within state government. Interlinking between sites and within sites gives the Google Crawler additional ability to find content, as well as improve the quality of the search.

Technical Tips

Make sure that the Google crawler can read your content

Validate all HTML content to ensure that the HTML is well-formed. Download and use a text browser such as Lynx to examine your site because most search engine spiders see your site much as Lynx would. If extra features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine crawlers may have trouble crawling your site.

Allow search bots to crawl your sites without session IDs or arguments that track their path through the site. These techniques are useful for tracking individual user behavior, but the access pattern of bots is entirely different. Using these techniques may result in multiple copies of the same document being indexed for your site, as crawl robots will see each unique URL (including session ID) as a unique document.

Ensure that your site's internal link structure provides a link path to all of your pages. Every page should be reachable from at least one link located elsewhere in the site. Google's search engine follows links from one page to the next, so pages that are not linked to by others may be missed. Additionally, you should contact web.work@state.ma.us to ensure that your site's home page is accessible to the search engine.

Technical Information for sites not hosted on Mass.Gov

Use robots standards to control search engine interaction with your content

Make use of the "robots.txt" file on your web server. This file tells search engine crawlers which files and directories can or cannot be crawled, including various file types. If the search engine gets an error when getting this file, no content will be crawled on that server. The robots.txt file will be checked on a regular basis, but changes may not have immediate results. Each port (including HTTP and HTTPS) requires its own robots.txt file.

Use robots meta tags to control whether individual documents are indexed, whether the links on a document should be crawled, and whether the document should be cached. The "NOARCHIVE" value for robots meta tags is supported by the Google search engine to block cached content, even though it is not mentioned in the robots standard.

For information on how robots.txt files and ROBOTS meta tags work, review the Robots Exclusion standard at www.robotstxt.org.

If the search engine is generating too much traffic on your site during peak hours, contact web.work@state.ma.us to customize the traffic.

Let the search engine know how fresh your content is

Make sure your web server supports the If-Modified-Since HTTP header. This feature allows your web server to tell the Google Search Appliance whether your content has changed since it last crawled your site. Supporting this feature saves you bandwidth and overhead.

Understand why some documents may be missing from the index

  • Each time that the Google Search Appliance updates its database of web pages, the documents in the index can change. Here are a few examples of reasons why pages may not appear in the index.

  • Your content pages may have been intentionally blocked by a robots.txt file or ROBOTS meta tags.

  • Your website was inaccessible when the crawl robot attempted to access it, due to network or server outage. If this happens, the Google Search Appliance will retry multiple times; but if the site cannot be crawled, it will not be included in the index.

  • The Google crawl robot cannot find a path of links to your site from the starting points it was given.

  • Your content pages may not be considered relevant to the query you entered. Ensure that the query terms exist on your target page.

  • Your content pages contain invalid HTML code.

If you still have questions, please contact web.work@state.ma.us.

Avoid using frames

The Google search engine supports frames to the extent that it can. Frames tend to cause problems with search engines, bookmarks, email links and so on, because frames don't fit the conceptual model of the web (where every document corresponds to a single URL).

Searches that return framed pages most likely only produce hits against the body HTML page and present it back without the original framed Menu or Header pages. Google recommends that you use tables or dynamically generate content into a single page (using ASP, JSP, PHP, etc.), instead of using FRAME tags. This will ultimately maintain the content owner's originally intended look and feel, as well as allow most search engines to properly index your content.

Avoid placing content and links in script code

Most search engines do not read any information found in SCRIPT tags within an HTML document. This means that content within script code will not be indexed, and links that are contained within script code will not be followed when crawling. When using a scripting language, make sure that your content and links are outside SCRIPT tags. Investigate alternate HTML technologies to dynamic web pages, such as HTML layers.


Information provided by the Information Technology Division, Mass.Gov Office. Last reviewed: January 12, 2012.