seo |

SEO Finds In Your Server Log

I am a huge Portland Trail Blazers fan, and in the early 2000s, my favorite player was Rasheed Wallace. He was a lightning-rod of a player, and fans either loved or hated him. He led the league in technical fouls nearly every year he was a Blazer; mostly because he never thought he committed any sort of foul. Many of those said technicals came when the opposing player missed a free-throw attempt and ‘Sheed’ passionately screamed his mantra: “BALL DON’T LIE.”

‘Sheed’ asserts that a basketball has metaphysical powers that acts as a system of checks and balances for the integrity of the game. While this is debatable (ok, probably not true), there is a parallel to technical SEO: marketers and developers often commit SEO fouls when architecting a site or creating content, but implicitly deny that anything is wrong.

As SEOs, we use all sorts of tools to glean insight into technical issues that may be hurting us: web analytics, crawl diagnostics, and Google and Bing Webmaster tools. All of these tools are useful, but there are undoubtedly holes in the data. There is only one true record of how search engines, such as Googlebot, process your website. These are web server logs. As I am sure Rasheed Wallace would agree, logs are a powerful source of oft-underutilized data that helps keep the integrity of your site’s crawl by search engines in check.

A server log is a detailed record of every action performed by a particular server. In the case of a web server, you can get a lot of useful information. In fact, back in the day before free analytics (like Google Analytics) existed, it was common to just parse and review your web logs with software like AWStats.

I initially planned on writing a single post on this subject, but as I got going I realized that there was a lot of ground to cover. Instead, I will break it into 2 parts, each highlighting different problems that can be found in your web server logs:

This post: how to retrieve and parse a log file, and identifying problems based on your server’s response code (404, 302, 500, etc.).
The next post: identifying duplicate content, encouraging efficient crawling, reviewing trends, and looking for patterns and a few bonus non-SEO related tips.

Step #1: Fetching a log file

Web server logs come in many different formats, and the retrieval method depends on the type of server your site runs on. Apache and Microsoft IIS are two of the most common. The examples in this post will based on an Apache log file from SEOmoz.

If you work in a company with a Sys Admin, be really nice and ask him/her for a log file with a day’s worth of data and the fields that are listed below. I’d recommend keeping the size of the file below 1 gig as the log file parser you’re using might choke up. If you have to generate the file on your own, the method for doing so depends on how your site is hosted. Some hosting services store them in your home directory in a folder called /logs and will drop a compressed log file in that folder on a daily basis. You’ll want to make sure to it includes the following columns:

Host: you will use this to filter out internal traffic. In SEOmoz’s case, RogerBot spends a lot of time crawling the site and needed to be removed for our analysis.
Date: if you are analyzing multiple days this will allow you to analyze search engine crawl rate trends by day.
Page/File: this will tell you which directory and file is being crawled and can help pinpoint endemic issues in certain sections or with types of content.
Response code: knowing the response of the server — the page loaded fine (200), was not found (404), the server was down (503) — provides invaluable insight into inefficiencies that the crawlers may be running into.
Referrers: while this isn’t necessarily useful for analyzing search bots, it is very valuable for other traffic analysis.
User Agent: this field will tell you which search engine made the request and without this field, a crawl analysis cannot be performed.

Apache log files by default are returned without User Agent or Referrer — this is known as a “common log file.” You will need to request a “combine log file.” Make your Sys Admin’s job a little easier (and maybe even impress) and request the following format:

LogFormat “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\””

For Apache 1.3 you just need “combined CustomLog log/acces_log combined”

For those who need to manually pull the logs, you will need to create a directive in the httpd.conf file with one of the above. A lot more detail here on this subject.

Step #2: Parsing a log file

You probably now have a compressed log file like ‘mylogfile.gz’ and it’s time to start digging in. There are myriad software products, free and paid, to analyze and/or parse log files. My main criteria for picking one includes: the ability to view the raw data, the ability to filter prior to parsing, and the ability to export to CSV. I landed on Web Log Explorer (http://www.exacttrend.com/WebLogExplorer/) and it has worked for me for several years. I will use it along with Excel for this demonstration. I’ve used AWstats for basic analysis, but found that it does not offer the level of control and flexibility that I need. I’m sure there are several more out there that will get the job done.

The first step is to import your file into your parsing software. Most web log parsers will accept various formats and have a simple wizard to guide you through the import. With the first pass of the analysis, I like to see all the data and do not apply any filters. At this point, you can do one of two things: prep the data in the parse and export for analysis in Excel, or do the majority of the analysis in the parser itself. I like doing the analysis in Excel in order to create a model for trending (I’ll get into this in the follow-up post). If you want to do a quick analysis of your logs, using the parser software is a good option.

Import Wizard: make sure to include the parameters in the URL string. As I will demonstrate in later posts this will help us find problematic crawl paths and potential sources for duplicate content.

You can choose to filter the data using some basic regex before it is parsed. For example, if you only wanted to analyze traffic to a particular section of your site you could do something like:

Once you have your data loaded into the log parser, export all spider requests and include all response codes:

Once you have exported the file to CSV and opened in Excel, here are some steps and examples to get the data ready for pivoting into analysis and action:

1. Page/File: in our analysis we will try to expose directories that could be problematic so we want to isolate the directory from the file. The formula I use to do this in Excel looks something like this.

Formula: <would like to put this is a textbox of some sort>

=IF(ISNUMBER(SEARCH(“/”,C29,2)),MID(C29,(SEARCH(“/”,C29)),(SEARCH(“/”,C29,(SEARCH(“/”,C29)+1)))-(SEARCH(“/”,C29))),”no directory”)

2. User Agent: in order to limit our analysis to the search engines we care about, we need to search this field for specific bots. In this example, I’m including Googlebot, Googlebot-Images, BingBot, Yahoo, Yandex and Baidu.

Formula (yeah, it’s U-G-L-Y)

=IF(ISNUMBER(SEARCH(“googlebot-image”,H29)),”GoogleBot-Image”, IF(ISNUMBER(SEARCH(“googlebot”,H29)),”GoogleBot”,IF(ISNUMBER(SEARCH(“bing”,H29)),”BingBot”,IF(ISNUMBER(SEARCH(“Yahoo”,H29)),”Yahoo”, IF(ISNUMBER(SEARCH(“yandex”,H29)),”yandex”,IF(ISNUMBER(SEARCH(“baidu”,H29)),”Baidu”, “other”))))))

Your log file is now ready for some analysis and should look something like this:

Let’s take a breather, shall we?

Step # 3: Uncover server and response code errors

The quickest way to suss out issues that search engines are having with the crawl of your site is to look at the server response codes that are being served. Too many 404s (page not found) can mean that precious crawl resources are being wasted. Massive 302 redirects can point to link equity dead-ends in your site architecture. While Google Webmaster Tools provides some information on such errors, they do not provide a complete picture: LOGS DON’T LIE.

The first step to the analysis is to generate a pivot table from your log data. Our goal here is to isolate the spiders along with the response codes that are being served. Select all of your data and go to ‘Data>Pivot Table.’

On the most basic level, let’s see who is crawling SEOmoz on this particular day:

There are no definitive conclusions that we can make from this data, but there are a few things that should be noted for further analysis. First, BingBot is crawling the site at about an 80% more clip. Why? Second, ‘other’ bots account for nearly half of the crawls. Did we miss something in our search of the User Agent field? As for the latter, we can see from a quick glance that most of which is accounting for ‘other’ is RogerBot — we’ll exclude this.

Next, let’s have a look at server codes for the engines that we care most about.

I’ve highlighted the areas that we will want to take a closer look. Overall, the ratio of good to bad looks healthy, but since we live by the mantra that “every little bit helps” let’s try to figure out what’s going on.

1. Why is Bing crawling the site at 2x that of Google? We should investigate to see if Bing is crawling inefficiently and if there is anything we can do to help them along or if Google is not crawling as deep as Bing and if there is anything we can do to encourage a deeper crawl.

By isolating the pages that were successfully served (200s) to BingBot the potential culprit is immediately apparent. Nearly 60,000 of 100,000 pages that BingBot crawled successfully were user login redirects from a comment link.

The problem: SEOmoz is architected in such a way that if a comment link is requested and JavaScript is not enabled it will serve a redirect (being served as a 200 by the server) to an error page. With nearly 60% of Bing’s crawl being wasted on such dead-ends, it is important that SEOmoz block the engines from crawling.

The solution: add rel=’nofollow’ to all comment and reply to comment links. Typically, the ideal method for telling and engine not to crawl something is a directive in the robots.txt file. Unfortunately, that won’t work in this scenario because the URL is being served via the JavaScript after the click.

GoogleBot is dealing with the comment links better than Bing and avoiding them altogether. However, Google is crawling a handful of links sucessfully that are login redirects. Take a quick look at the robots.txt and you will see that this directory should probably be blocked.

2. The number of 302s being served to Google and Bing is acceptable, but it doesn’t hurt to review in case there are better ways for dealing with some of edge cases. For the most part SEOmoz is using 302s for defunct blog category architecture that redirects the user to the main blog page. They are also being used for private message pages /message, and a robots.txt directive should exclude these pages from being crawled at all.

3. Some of the most valuable data that you can get from your server logs are links that are being crawled that resolve in a 404. SEOmoz has done a good job managing these errors and does not have an alarming level of 404s. A quick way to identify potential problems is to isolate 404s by directory. This can be done by running a pivot table with “Directory” as your row label and count of “Directory” in your value field. You’ll get something like:

The problem: the main issue that’s popping here is 90% of the 404s are in one directory, /comments. Given the issues with BingBot and the JavaScript driven redirect mentioned above this doesn’t really come as a surprise.

The solution: the good news is that since we are already using rel=’nofollow’ on the comment links these 404s should also be taken care of.

Conclusion

Google and Bing Webmaster tools provide you information on crawl errors, but in many cases they limit the data. As SEOs we should use every source of data that is available and after all, there is only one source of data that you can truly rely on: your own.

LOGS DON’T LIE!

And for your viewing pleasure, here’s a bonus clip for reading the whole post.

Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don’t have time to hunt down but want to read!

Continue reading →

Back to the Future: Forecasting Your Organic Traffic

Posted by Dan Peskin

This post was originally in YouMoz, and was promoted to the main blog because it provides great value and interest to our community. The author’s views are entirely his or her own and may not reflect the views of SEOmoz, Inc.

Great Scott! I am finally back again for another spectacularly lengthy post, rich with wonderful titles, and this time – statistical goodness. It just so happens, that in my past short-lived career, I was a Forecast Analyst (not this kind). So today class, we will be learning about the importance of forecasting organic traffic and how you can get started. Let’s begin our journey.

I just put this here because it looks really cool.

Forecasting is Your Density. I Mean, Your Destiny

Why should I forecast? Besides the obvious answer – it’s f-ing cool to predict the future, there are a number of benefits for both you and your company.

Forecasting adds value in both an agency and in-house setting. It provides a more accurate way to set goals and plan for the future, which can be applied to client projects, internal projects, or overall team/dept. strategy.

Forecasting creates accountability for your team. It allows you to continually set goals based on projections and monitor performance through forecast accuracy (Keep in mind that exceeding goals is not necessarily a good thing, which is why forecast accuracy is important. We will discuss this more later).

Forecasting teaches you about inefficiencies in your team, process, and strategy. The more you segment your forecast, the deeper you can dive into finding the root of the inaccuracies in your projections. And the more granular you get, the more accurate your forecast, so you will see that segmentation is a function of accuracy (assuming you continually work to improve it).

Forecasting is money. This is the most important concept of forecasting, and probably the point in where you decided that you will read the rest of this article.

The fact that you can improve inefficiencies in your process and strategy through forecasting, means you can effectively increase ROI. Every hour and resource allocated to a strategy that doesn’t deliver results can be reallocated to something that proves to be a more stable source of increased organic traffic. So finding out what strategies consistently deliver the results you expect, means you’re investing money into resources that have a higher probability of delivering you a larger ROI.

Furthermore, providing accurate projections, whether it’s to a CFO, manager, or client, gives the reviewer a more compelling reason to invest in the work that backs the forecast. Basically, if you want a bigger budget to work with, forecast the potential outcome of that bigger budget and sell it. Sell it well.

Okay. Flux Capacitor, Fluxing. Forecast, Forecasting?

Contraption that I have no clue what it does

I am going to make the assumption that everyone’s DeLorean is in the shop, so how do we forecast our organic traffic?

There are four main factors to account for in an organic traffic forecast: historical trends, growth, seasonality, and events. Historical data is always the best place to start and create your forecast. You will want to have as many historical data points as possible, but the accuracy of the data should come first.

Determining the Accuracy of the Data

Once you have your historical data set, start analyzing it for outliers. An outlier to a forecast is what Biff is to George McFly, something you need to punch in the face and then make wash your car 20 years in the future. Well something like that.

The quick way to find outliers is to simply graph your data and look for spikes in the graph. Each spike is associated with a data point, which is your outlier, whether it spikes up or down. This way does leave room for error, as the determination of outliers is based on your judgement and not statistical significance.

The long way is much more fun and requires a bit of math. I’ll provide some formula refreshers along the way.

Calculating the mean and the standard deviation of your historical data is the first step.

Mean

Formula for finding the mean

Standard Deviation

Standard Deviation Formula

Looking at the standard deviation can immediately tell you whether you have outliers or not. The standard deviation tells you how close your data falls near the average or mean, so the lower the standard deviation, the closer the data points are to each other.

You can go a step further and set a rule by calculating the coefficient of variation (COV). As a general rule, if your COV is less than 1, the variance in your data is low and there is a good probability that you don’t need to adjust any data points.

Coefficient of Variation (COV)

Coefficient of Variation Formula

If all the signs point to you having significant outliers, you will now need to determine which data points those are. A simple way to do this is calculate how many standard deviations away from the mean your data point is.

Unfortunately, there is no clear cut rule to qualify an outlier with deviations from the mean. This is due to the fact that every data set is distributed differently. However, I would suggest starting with any data point that is more than one deviation from the mean.

Making your decision about whether outliers exist takes time and practice. These general rules of thumb can help you figure it out, but it really relies on your ability to interpret the data and be able to understand how each data point affects your forecast. You have the inside knowledge about your website, your equations and graphs don’t. So put that to use and start making your adjustments to your data accordingly.

Adjusting Outliers

Ask yourself one question: Should we account for this spike? Having spikes or outliers is normal, whether you need to do anything about it is what you should be asking yourself now. You want to use that inside knowledge of yours to determine why the spike occurred, whether it will happen again, and ultimately whether it should accounted for in your future forecast.

Organic Search Traffic Graph

In the case that you don’t want to account for an outlier, you will need to accurately adjust it down or up to the number it would have been without the event that caused the anomaly.

For example, let’s say you launched a super original infographic about the Olympics in July last year that brought your site an additional 2,000 visits that month. You may not want to account for this as it will not be a recurring event or maybe it fails to bring qualified organic traffic to the site (if the infographic traffic doesn’t convert, then your revenue forecast will be inaccurate). So the resulting action would be to adjust the July data point down 2,000 visits.

On the flipside, what if your retail electronics website has a huge positive spike in November due to Black Friday? You should expect that rise in traffic to continue this November and account for it in your forecast. The resulting action here is to simply leave the outlier alone and let the forecast do it’s business (This is also an example of seasonality which I will talk about more later).

Base Forecast

When creating your forecast, you want to create a base for it before you start incorporating additional factors into it. The base forecast is usually a flat forecast or a line straight down the middle of your charted data. In terms of numbers, this can be simply be using the mean for every data point. The line down the middle of the data follows the trend of the graph, so this would be the equivalent of the average but accounting for slope too. Excel provides a formula which actually does this for you:

=FORECAST(x, known_y’s,known_x’s)

Given the historical data, excel will output a forecast based on that data and the slope from the starting point to end point. Dependent on your data, your base forecast could be where you stop, or where you begin developing an accurate forecast.

Now how do you improve your forecast? It’s a simple idea – account for anything and everything the data might not be able to account for. Now you don’t need to go overboard here. I would draw the line well before you start forecasting the decrease in productivity on Fridays due to beer o clock. I suggest accounting for three key factors and accounting for them well; growth, seasonality, and events.

Growth

You have to have growth. If you aren’t planning to grow anytime soon, then this is going to be a really depressing forecast. Including growth can be as simple as adding 5% month over month, due to a higher level estimate from management, or as detailed as estimating incremental search traffic by keyword from significant ranking increases. Either way, the important part is being able to back your estimates with good data and know where to look for it. With organic traffic, growth can come from a number of sources but these are a couple key components to consider:

Are you launching new products?

New products means new pages, and dependent on your domain’s authority and your internal linking structure, you can see an influx of organic traffic. If you have analyzed the performance of newly launched pages, you should be able to estimate on average what percentage of search traffic from relevant and target keywords they can bring over time.

Using Google Webmaster Tools CTR data and the Adwords Tool for search volume are your best bet to acquire the data you need to estimate this. You can then apply this estimate to search volumes for the keywords that are relevant to each new product page and determine the additional growth in organic traffic that new product lines will bring.

Tip: Make sure to consider your link building strategies when analyzing past product page data. If you built links to these pages over the analyzed time period, then you should plan on doing the same for the new product pages.

What ongoing SEO efforts are increasing?

Did you get a link building budget increase? Are you retargeting several key pages on your website? These things can easily be factored in, as long as you have consistent data to back it up. Consistency in strategy is truly an asset, especially in the SEO world. With the frequency of algorithm updates, people tend to shift strategies fairly quickly. However, if you are consistent, you can quantify the results of your strategy and use it improve your strategy and understand its effects on the applied domain.

The general idea here is that if you know historically the effect of certain actions on a domain, then you can predict how relative changes to the domain will affect the future (given there are no drastic algorithm updates).

Let’s take a simple example. Let’s say you build 10 links to a domain per month and the average Page Authority is 30 and Domain Authority is 50 for the targeted pages and domain when you started. Over time you see as a result, your organic traffic increase by 20% for the pages you targeted on this campaign. So if your budget increases and allows you to apply the same campaign to other pages on the website, you can estimate an increase in organic traffic of 20% to those pages.

This example assumes the new target pages have:

Target keywords with similar search volumes
Similar authority at prior to the campaign start
Similar existing traffic and ranking metrics
Similar competition

While this may be a lot to assume, this is for the purpose of the example. However, these are things that will need to be considered and these are the types of campaigns that should be invested in from a SEO standpoint. When you find a strategy that works, repeat it and control the factors as much as possible. This will provide for an outcome that is the least likely to diverge from expected results.

Seasonality

To incorporate seasonality into a organic traffic forecast, you will need to create seasonal indices for each month of the year. A seasonal index is an index of how that month’s expected value relates to the average expected value. So in this case, it would be how each month’s organic traffic compares with average or mean monthly organic traffic.

So let’s say your average organic traffic is 100,000 visitors per month and your adjusted traffic for last November was 150,000 visitors, then your index for November is 1.5. In your forecast you simply multiply by this weight for the corresponding index month.

To calculate these seasonal indices, you need data of course. Using adjusted historical data is the best solution, if you know that it reflects the seasonality of the website’s traffic well.

Remember all that seasonal search volume data the Adwords tool provides? That can actually be put to practical use! So if you haven’t already, you should probably get with the times and download the Adwords API excel plugin from SEOgadget (if you have API access). This can make gathering seasonal data for a large set of keywords quick and easy.

What you can do here, is gather data for all the keywords that drive your organic traffic, aggregate it, and see if the trends in search align with the seasonality you are observing in your adjusted historical data. If there is a major discrepancy between the two, you may need to dig deeper into why or shy away from accounting for it in your forecast.

Events

This one should be straightforward. If you have big events coming up, find a way to estimate their impact on your organic traffic. Events can be anything from a yearly sale, to a big piece of content being pushed out, or a planned feature on a big media site.

All you have to do here is determine the expected increase in traffic from each event you have planned. This all goes back to digging into your historical data. What typically happens when you have a sale? What’s the change in traffic when you launch a huge content piece? If you can get an estimate of this, just add it to the corresponding month when the event will take place.

Once you have this covered, you should have the last piece to a good looking forecast. Now it’s time to put it to the test.

Forecast Accuracy

So you have looked into your crystal ball and finally made your predictions, but what do you do now? Well the process of forecasting is a cycle and you now need to measure the accuracy of your predictions. Once you have the actuals to compare to your forecast, you can measure your forecast accuracy and use this to determine whether your current forecasting model is working.

There is a basic formula you can use to compare your forecast to your actual results, which is the mean absolute percent error (MAPE):

MAPE formula

This formula requires you to calculate the mean of the absolute percent error for each time period, giving you your forecast accuracy for the total given forecast period.

Additionally, you will want to analyze your forecast accuracy for just a single period if your forecast accuracy is low. Looking at the percent error month to month will allow you to pin point where the largest error in your forecast is and help you determine the root of the problem.

Keep in mind that accuracy is crucial if organic traffic is a powerful source of product revenue for your business. This is where exceeding expectations can be a bad thing. If you exceed forecast, this can result in stock outs on products and a loss in potential revenue.

Consider the typical online consumer, do you think they will wait to purchase your product on your site if they can find it somewhere else? Online shoppers want immediate results, so making sure you can fulfil their order makes for better customer service and less bounces on product pages (which can affect rank as we know).

Google Results for Vizio 19in

Walmart Vizio TV

Top result for this query is out of stock, which will not help maintain that position in the long term.

Now this doesn’t mean you should over forecast. There is a price to pay on both ends of the spectrum. Inflating your forecast means you could be bringing in excess inventory as it ties to product expectations. This can bring in unnecessary inventory expenses such as increased storage costs and tie up cash flow until the excess product is shipped. And dependent on product life cycles, continuing this practice can lead to an abundance of obsolete product and huge financial problems.

So once you have measured your forecast to actuals and considered the above, you can repeat the process more accurately and refine your forecast! Well this concludes our crash course in forecasting and how to apply it to organic traffic. So what are you waiting for? Start forecasting!

Oh and here is a little treat to get you started.

Are you telling me you built a time machine…in Excel?

Well no, Excel can’t help you time travel, but it can help you forecast. The way I see it, if you’re gonna build a forecast in Excel, why not do it in style?

I decided that your brain has probably gone to mush by now, so I am going to help you on your way to forecasting until the end of days. I am providing a stylish little excel template that has several features, but I warn you it doesn’t do all the work.

It’s nothing to spectacular, but this template will put you on your way to analyzing your historical data and building your forecast. Forecasting isn’t an exact science, so naturally you need to do some work and make the call on what needs to be added or subtracted to the data.

What this excel template provides:

The ability to plug in the last two years of monthly organic traffic data and see a number of statistical calculations that will allow you to quickly analyze your historical data.
Provides you with the frequency distribution of your data.
Highlights the data points that are more than a standard deviation from the mean.
Provides you with some metrics we discussed (mean, growth rate, standard deviation, etc).

Oh wait there’s more?

Yes. Yes. Yes. This simple tool will graph your historical and forecast data, provide you with a base forecast, and a place to easily add anything you need to account for in the forecast. Lastly, for those who don’t have revenue data tied to Analytics, it provides you with a place to add your AOV and Average Conversion Rate to estimate future organic revenue as well. Now go have some fun with it.

________________________________________________________________________________________

Obviously we can’t cover everything you need to know about forecasting in a single blog post. That goes both from a strategic and mathematical standpoint. So let me know what you think, what I missed, or if there are any points or tools that you think are applicable for the typical marketer to add to their skillset and spend some time learning.