There’s been some discussion on Hacker News about the use of Google Analytics. Specifically legality in the EU and privacy concerns. Companies must stop using Google Analytics - as per the Swedish Authority for Privacy Protection (IMY).

I remember before we had Google Analytics. The discussion back then was between Webalizer and AWStats and I’m surprised to see that AwStats has a beta version from January this year, so clearly it’s being actively developed. Webalizer’s last version is 10 years old, but maybe that’s still fine. It seems the log format from apache hasn’t changed, so if the input is the same, perhaps the program isn’t unmaintained, so much as complete. But then what is AwStats up to? Perhaps geo-ip stuff. I can’t be bothered to check right now.

I’ve had a look for more up-to-date programs, and found GoAccess.io which I would have expected, with a name like that, to be a program written in go, but it’s written in C.

I knocked up a script to go and fetch some log files, and then use goaccess to create some output for each of my domains.

#!/bin/bash
server='your-server-name'
Sites=('www.mysite1.com' 'www.mysite2.com')

if [ ! -d $server-logs ]; then
	mkdir $server-logs/
	rsync -uarv $server:/var/log/apache2 $server-logs/
else
	rsync -uarv $server:/var/log/apache2/*.log $server-logs/apache2/
	rsync -uarv $server:/var/log/apache2/*.1 $server-logs/apache2/
fi

gunzip -f $server-logs/apache2/*.gz
for $domain in "${Sites[@]}"; do
	if [ ! -d db_cache/$server/$domain ]; then
		mkdir -p db_cache/$server/$domain/
	fi
	goaccess \
		dvl-logs/apache2/$domain.access.log* \
		-o html/$domain.$date.html \
		-o json/$domain.$date.json \
		--persist \
		--restore \
		--db-path=db_cache/$server/$domain \
        --html-prefs='{"theme":"bright","perPage":20,"layout":"vertical","showTables":true,"visitors":{"plot":{"chartType":"bar"}}}'\
		--html-report-title=$i \
		--log-format=COMBINED
done

This script, run at least every day, though it could be many times a day, will gather the logs from apache and produce a html and json file with the stats. I found a few gotchas, the db-path needed to be specified to keep it separate for each domain. I think really the –restore and –persist options should not function without a –db-path specified.

So I’ve run it a few times, and I’m noticing some really interesting things on the sites I look after.

Differences between goaccess and google analytics.

Loads. Loads and loads. They are completely different and answer completely different questions.

Questions I care about, which are answered well by google analytics

  • Conversions
  • Flow through the site
  • Entry and exit pages
  • A floor level of how many users visited (see below for more)
  • What’s happening right now

Questions that I have to go to log based analytics to see

  • Server errors
  • Static asset hits & related bandwidth
  • Bandwidth usage
  • Crawler usage

The thorny issue of “Visits” and “Users”

Both Google Analytics and Log based analytics fail on “Users”, and as a result, “Visits”. I think Google Analytics is far better, but because it relies upon Javascript working and a separate network request then it could be missing some data, and we wouldn’t know how much.

On the other hand the log based analytics, and this was true for webalizer and awstats back in the day, and still seems to be true for goaccess now, is that users and visitors information is far far too high.

What they do, and indeed, probably all they can do given the content of the logs, is come up with their own way of deciding what a user is. GoAccess define it as:

Hits having the same ip, date and agent are a unique visit.

Which means, if there are multiple people coming from the same Ip with the same agent on the same date, they would only show up as 1 visit, and similarly if they come from many ip addresses they would show up as many users.

Given how mobile phones are constantly changing ip address we see that actually the visitor count on the log based analytics is so very far out from Google analytics. One one site I have, Google Analytics reports 29 users in a day, and goaccess is reporting 722 for the same day.

I know Google Analytics is a floor value and the log based analytics is a ceiling value. If I tell goaccess to ignore crawlers I get it down to 520 users, which is still a massive difference.

My gut feeling is that Google Analytics is by far the better one at answering the question. I’m not sure though even how to test that.

But it’s still approximately 1/20th of what the server is reporting. Which is bonkers. What’s going on? Here’s a few ideas

  • Crawlers are coming in from different IPs all the time, but the goaccess doesn’t see them as crawlers
  • Where we use lazy loading a scroll down the page could trigger more hits, and if that were to happen the next day it would be another user
  • Google analytics doesn’t think of a visit as being a “new visit” just because the date changed. I don’t know if this is the case.
  • Mobile phones on Carrier grade nat are constantly changing their IP address, a lot more than I would expect

Looking at GoAccess I see they have a chart that shows “visitors per IP” and there’s a range of these that are quite high. 31.14.26.72 has 71 visitors, that’s over 2 weeks. 110.239.215.165 has 70 visitors.

Thanks to lnav I’ve been able to see quite quickly, that actually, those 71 vistors over the last 2 weeks, are all crawlers pretending to be normal users.

For the ideal answer to the question “How many users were there” I guess you would need to have some qualifying evidence of realness. Ideas are

  • A static asset (like a logo) is requested along side a page. This could go wrong with caching issues. Expecially if you put a CDN in front of the site.
  • You see the user come back with a specific cookie, like a PHP session cookie
  • The timing pattern of requests looks random enough.

I guess this gets into the weeds of stats, but it feels really important, since it can make such a different outcome in the figures, which, if they are important at all, should be right.