Unbiased Web Stats For Germany

We know these are hard to come by but there is hope. In particular regions there are sites:

  • of particular interest to the regions,
  • with subject matter rather remote from operating systems and software,
  • with a good volume, and
  • with a log analyzer like AWstats.

In the hope of finding such sites, I installed AWstats and examined the output for strings I could search. I then used Google and “”Reported period” “Month” “linux” site:.de” to zero in on outputs.

I found some crazy stuff like webservers misconfigured and actually delivering the counter script instead of HTML…

Here’s a promising site. It’s in German but it seems to be about sports and has good volume with 64K unique visitors and so on. There are thousands of visits per day and hundreds of thousands of hits per day. According to Netcraft it runs on GNU/Linux, from 2004 to 2011 with Suse and then with CentOS. The result?

  • 81% That Other OS,
  • 8.1% */Linux, and
  • 7.6% MacOS

The bulk of the “Linux” hits are classified as unknown distros so some could be smart thingies. It could be biased to young males and Germany but at least it’s not ~1%. There are a few hosts who are consistently in the top 10 of viewers but we don’t know what OS they run.

I found an ArchLinux repository. Of course it had 94% GNU/Linux but it was interesting to see it got hits from all over the world, 1000 unique visitors.

The Transcarpathia Benefit Society had AWstats but just a few hundred unique visitors and 3.5% */Linux.

A students newsletter had ~1000 unique visitors each month and 18% */Linux.

I think it is safe to say that way more than 1% of real people use GNU/Linux in Germany.

About Robert Pogson

I am a retired teacher in Canada. I taught in the subject areas where I have worked for almost forty years: maths, physics, chemistry and computers. I love hunting, fishing, picking berries and mushrooms, too.
This entry was posted in technology and tagged , . Bookmark the permalink.

24 Responses to Unbiased Web Stats For Germany

  1. John Cockroft wrote, “is being activated on 900,000 NEW devices a day “.

    That would be wonderful if true, but Google has just announced that they are up to 1.3 million activations per day, which is fabulous, or something.

    Your points are otherwise valid. 😉

  2. John Cockroft says:

    Well, well, well! The same tired Linux haters putting out the same tired nonsense!

    If Linux is so insignificant then why do you even bother to comment on blogs like these. The answer is (of course) that it isn’t significant anymore. Android in particular (which is a Linux distribution) is being activated on 900,000 NEW devices a day and the Google and 25 BILLION applications have been downloaded from Google Play! Android is more popular than iOS (Apple iPhone) worldwide. My Samsung Galaxy S3 runs kernel 3.0.15 and the CyanogenMod 10 distribution – with latest security patches) – yup – most definitely Linux!

    There are two million pre-orders for the Raspberry Pi computer (which runs Linux) and from this month 30,000 units a month should be shipped. There are at least 20 million DESKTOP users of Ubuntu worldwide – and that does not count Cloud VM usage. Most VMs deployed are Linux.

    The reason that there are less desktop PCs (which are NOT ‘Windows’ PCs – they are generic PCs which happen to run Windows) running Linux is that Microsoft force pre-loading of Windows on PCs ‘to prevent piracy’. Yeah right! To prevent competition (and enforce the existing monopoly) more like. People do not have a choice (in computer shops). They are FORCED to buy proprietary operating systems (either Windows or a much poorer second OS/X on Apple appliances – which include Macs) and not given any other choices. Most people have heard of Android but most people have not even heard of Ubuntu, Mint, Fedora, openSuSE and other desktop Linux distributions. If they did – and knew they could have thousands of applications for free, a fast and easy to use virus free desktop and their favourite browsers like Chrome and Firefox and MOST IMPORTANTLY there were sample laptops in shops nicely configured and running these operating systems then I suspect it would be a very different story. I also suspect that the trolls who post here are scared about that happening – hence the continuous put-downs.

  3. oiaohm says:

    Mongrol and Robert Pogson.

    The possible error is so high because none of the webstats do site demographics or anything else to make sure they are truly collecting from a broad section of everyone in the population.

    –The thing that kills many web stats is not statistical variation but huge biases like counting m$-only sites or clients from business domains or you name it.–

    Lack of demographics to reduce error is what this defines. For the numbers to be trusted the demographics of collection have to be reported and checkable. From that you could work out if a demographic was missed and possible how large of a demographic.

    Mongrol
    –You denigrate any statistics that do not match your worldview as incorrect–
    Best world-view statistics are based on census documents, mobile phone carriers and other items that get huge percentages of populations like with a very low error value. Problem we don’t have this for desktop OS usage. I don’t denigrate any statistics only the ones with method errors.

    Next best do demographic neutral profiling to locate collection points that are not bias against any demographic. Name sites that everyone would possible visit. You want to pick a neutral list of sites to get fairly good numbers and is pretty much demographic neutral without monitoring them all. Best answer is most likely Search engines. Problem is search engines don’t publish their visitor numbers.

    Another number not collected is how many ISP accounts are there globally. This would enable you to know when you possibly seen everyone.

    ISP run survey could give some very correct data.

    Normally distributed error values are only true if you collection design is not major-ally flawed in demographics.

    The reality the numbers MS Trolls use don’t release there collection demographics so bias could be worst possible case.

    The one site you have is a very small demographic so it does not rate either.

    Collecting usable statistics is not simple.

    Basically doing what current web stats do setting up a site collecting from who ever will give them data results in data that is pure garbage because the sites could cause 1 or more demographics to over show up.

    In fact the do not track item that browsers are now including will make web stats less dependable.

    This is why Apache httpd is disregarding IE do not track that is enabled by default. I really wish a court would say Apache is not allowed todo this. Result would be all IE users no longer be track-able so screwing up web stats even more.

    Tracking users is required so you don’t count them many times. Cookie clearing done by some anti-virus software cause windows users to be over counted as well.

    Welcome to the hell trying to get sanity from web stats. Error after Error in webstats. So even on the single site you picked the MS users might be less than displayed due to the interaction of anti-virus with web browser.

    1)IP numbers are not constants and don’t tell you how many machines are there. So 1 count on IP based might be many hundred or even millions.
    2)Tracking information to attempt to see around the IP you are not sure will remain so might be removed by anti-virus or be for-bin to place. So this method is errored.
    3)Demographic issues in sites web stats are getting information from. Since most web stats are based off sites with advertising.
    4)Lack of volume to be able to cancel out the errors.

    Worse the biggest advertiser Google does not release their information to anyone. So they are playing on small subsets.

    Basically if Google wanted to they could directly answer the question better than everyone else publishing web stats.

    Very particular errors that effect web stats that results in them being nothing more than garbage.

    Mongrol can you address the 4 web stats defects. If not you really should not be using it.

    When you know the defects you know the sooner people stop using them the better and then we might have the presume to use like ISP surveys or pressure on search provides to publish there numbers or other means to find out how many Windows/Linux users there are that are not going to be as defective.

    –Think a large number of sites causes reliable sampling? Think a large number of sites causes reliable sampling? Consider counting PCs only from a business domain during office hours. Google was counted because they are from a business domain. Since most businesses use that other OS, that’s a huge bias. How else do you explain that Google has such huge pull on NetApps web stats? Perhaps Google is counted correctly but all those other users of GNU/Linux are not.–

    Google has a huge pool of direct IP address there staff uses. Google will not count with the same error as normal business.

    NetApps web stats effects due to a Google move is because it collection is so small so its possible error is so high. So google could cause there numbers to shock-wave. Large enough collection or collection from non bias sites should not cause shock-wave.

    Brute force method requires a lot more effort than brains. NetApps collection method is brute force. So to get somewhere near dependable with Brute Force you want between 10 to 20 percent of the possible population in the numbers. With less than 1.5 percent you require very careful selection of collection points or your numbers are most likely bogus.

  4. oiaohm wrote, “The possible error for 3 million sites is +/- 98.4 percent.”

    How do you figure that? A large error is always possible but unlikely. That is the nature of normally distributed events. A normal distribution is what results from the combination of a large number of additive effects. The most probably outcome is the average value or arithmetic mean, and the tails a few standard deviations away from that mean are very small. The thing that kills many web stats is not statistical variation but huge biases like counting m$-only sites or clients from business domains or you name it… I tried to show stats from a site with known biases probably for German young men could show more than ~1% for GNU/Linux. Counting 3 million biased sites would not be more reliable than that. The number of counted sites is not an indication of correct sampling of the whole population of PCs either but at least with one site, we know what is counted.

    Think a large number of sites causes reliable sampling? Consider counting PCs only from a business domain during office hours. Google was counted because they are from a business domain. Since most businesses use that other OS, that’s a huge bias. How else do you explain that Google has such huge pull on NetApps web stats? Perhaps Google is counted correctly but all those other users of GNU/Linux are not.

  5. Mongrol wrote, ““GNU/Android” – No such thing, Robert.”

    I know that. I meant it to be read as “GNU or Android”.

  6. Mongrol says:

    I wasn’t being sarcastic.

    It’s pretty much a statement of fact. You denigrate any statistics that do not match your worldview as incorrect, biased, or even bought and paid-for corrupt. Yet statistics that reinforce your position are seen as reliable. Even if they’re from the same source.

    “GNU/Android” – No such thing, Robert. Android contains “very little”[1] GNU code, and barring the kernel, Android is licensed under the Apache license, not the GPL.

    [1] http://www.gnu.org/philosophy/android-and-users-freedom.html

  7. oiaohm says:

    The possible error for 3 million sites is +/- 98.4 percent. So making the collected numbers basically worthless. The possible error value is based on the percentage you did not count.

    1 site is a scary +/- 99.9999995 percent only one rule the stats cannot go negative.

    So by the error factor MS could be just 1 percent market share. Of course I don’t suspect that is true but its how bad the quality of the web stat data is.

  8. oiaohm says:

    Robert Pogson Tipping points are not universal. Some areas have tipped this is correct. Some areas are still tipping. Some areas are resistant at this stage. Without good stats working on what areas need work is going to be hard.

    Mongrol trying to get figure that look sane from bad collection is down right impossible. Basically Robert has been trying to solve the conflicting data. Instead of facing the simple reality the webstats are stuffed. They are not collected properly to give useful numbers trying to create a correction formula is impossible to many variables like what percentage of the population would not visit that site.

    The old rule garbage in garbage out. Webstats on desktop usage are garbage and there is no way to fix the errors in them. This is why some sites show like 1 percent others show like 10 then other sites show like 50 percent. Put 3 million sites does not help you because its not enough sites.

    The two useful stats would be number of Full -time developers working on Windows based software and Number of full-time Developers working on Linux based source. This is another tipping point detection.

    Mongrol one of the ways to attempt to remove statical error is integrating the numbers. This only works if you are sure the numbers are not overlapping with each other.

    Merging the results from two different experiments over the same thing works as long as they share no data. Problem with webstats is like NetApplications you don’t have site list to know what sites stats you can merge in and what ones were already added. So you cannot correct any bias caused by site selection or even see if there is a site selection bias problem.

    Web stats basically say trust us we know what we are doing. This is not science. They is voodoo magic numbers.

    –web stats ranging from 1% to 20% depending on source and region.–
    This is huge volatility. We know mobile phone usage does not show this volatility. Same users operate both so you would not expect much different. This is a symptom of the problem. Stats that have too high of volatility normally are failing to collect correct data you normally have stuff ups somewhere.

    Mongrol reality MS Troll like web stats and never check quality of collection of them. Linux people end up looking at stats trying to solve the insanity of the mess not waking quickly they are wasting their time because the web stats are statistical garbage this is why they show too much volatility.

    “Then there was the man who drowned crossing a stream with an average depth of six inches.”
    W. I. E. Gates

    This is a important quote. So all the site to site averaging can result in a figure that is way off.

    Precision calculation are Shockley bad for web stats.

    Basically number of site visiting users collected from over number of active sites in existence equals you percentage counted.

    So one site 1/190968692. Ok that is bugger all.

    3 000 000 sites ok that a lot. Other than the fact that is only 1.5 percent of possible sites. Nothing says everyone will visit them.

    So what about the other 187968692 sites users who did not visit the 3 000 000. Are you sure you know what they are doing.

    Remember you have to times the web stats by the correction figure.

    0.015709381 is for a stats of a site with 3 000 000 collection point.

    So there total numbers only can be suspected to represent 1.6 percent of the total Internet users. So where are the other 98.4. You are presuming they are the same. No evidences says they are.

    The maths to work out the possible error value is not hard. Sample size is simply way too small to provide dependable numbers. Same issue hits people poling policitical parties to find out how the vote will go. These are done very carefully to attempt to avoid bias but since only a small group of users are asked at election the numbers can be not even close.

  9. Mongrol wrote, sarcastically, “NetApplications are biased and unreliable when they show a number you don’t like, but rock-solid when niches, freak results, or cherry-picked examples back you up.”

    Of course not. NetApplications has a clear bias to business use one cannot ignore when looking at Sunnyvale, California, but they likely do not have a bias between Android and GNU/Linux. The fact that NetApplications disagrees violently with share of GNU/Linux but not about GNU/Android supports that.

  10. Mongrol says:

    “We see that in the Wikimedia stats but NetApplications shows, for Germany, Android=1.95% and GNU/Linux=1.62%, just slightly more than half. The same ratio for 8.1% */Linux would be 1.95(1.95+1.62)8.1=4.42% Android and 1.62(1.95+1.62)8.1=3.67% GNU. Still a long way past ~1%.”

    Typical. NetApplications are biased and unreliable when they show a number you don’t like, but rock-solid when niches, freak results, or cherry-picked examples back you up.

  11. oiaohm wrote, “Basically when it comes to statistics that would be useful to work out how far from the tipping point we are we don’t have them.”

    The tipping point was long ago, probably around Vista. We now have whole governments migrating to GNU/Linux. Brazil, Russia, India, China, Malaysia and a bunch preferring FLOSS. That’s tipped.

    Governments are often the largest business in any country and they set trends that suppliers and other businesses notice. Even USA/Canada have begun to formalize policies for using FLOSS. Once open standards and FLOSS apps are adopted, there’s nothing preventing adoption of GNU/Linux on cost/maintenance basis.

  12. oiaohm says:

    Robert Pogson
    –I have for several years recorded all kinds of indications that GNU/Linux is thriving and growing rapidly. What more do you need to show that an OS is alive than:–

    Robert Pogson post number 11 is more to the point. We know web stats no matter the source are flawed. So no point referencing them. Better to start collating the more exact information where we can.

    –space on retail shelves in many parts of the world–
    We don’t have maps of where this is and what percentage of retail space Linux has managed to get. Yes stats on this are lacking.

    –OEMs making GNU/Linux boxes and notebooks,–
    Good sign but these OEMS are also very close to chest on how many Linux units they ship.

    –millions of developers cranking out the OS and apps for it,–
    This we do have fairly good numbers for. Down to what percentage are full time paid that is between 70 to 80 percent.

    –huge roll-outs in Russia, Brazil, India and China–
    This has not be collated on how many desktops this is and how many developers each migration brings.

    Basically when it comes to statistics that would be useful to work out how far from the tipping point we are we don’t have them.

  13. oiaohm says:

    Yep that is the thing when you start comparing the numbers between different web sources they end up not adding up. The completely conflict with each other at times.

    This is the problem one time I did serous-ally go looking trying to find some real numbers.

    Mobile phones you can get real numbers from the carriers. Each phone has a unique id number that does tell you make and model. Carrier sees this. So they can tell you exactly how many of x type phone were connected to there network in the last 30 days.

    Super computers you get a good count because most super locations are known.

    Web servers good approx. We don’t know exactly how many are behind each site but we are fairly right. Yes there is a error but its allowable because it still tells you want you need to know. Ie how many sites are on what is good for developers. Not exactly perfect become some sites could have 1 server some could have thousands.

    Desktop we are just screwed. Every measure is guess work combining more just makes the guess work worse.

  14. oiaohm contradicts himself by writing, “Robert Pogson does not have any that is fully alive.” and “There is enough recorded migrations to Linux Desktops to say its a alive the question is how far.”

    I have for several years recorded all kinds of indications that GNU/Linux is thriving and growing rapidly. What more do you need to show that an OS is alive than:

  15. space on retail shelves in many parts of the world,
  16. OEMs making GNU/Linux boxes and notebooks,
  17. millions of developers cranking out the OS and apps for it,
  18. huge roll-outs in Russia, Brazil, India and China, and
  19. web stats ranging from 1% to 20% depending on source and region.