I remember reading about Benford's law years ago with fascination and thought I'd share it. Such a fun use of maths in the real world. Here's one application:
Dr. Theodore P. Hill asks his mathematics students at the Georgia Institute of Technology to go home and either flip a coin 200 times and record the results, or merely pretend to flip a coin and fake 200 results. The following day he runs his eye over the homework data, and to the students' amazement, he easily fingers nearly all those who faked their tosses.
Smart, eh? Its all because people don't know enough about how numbers really work and so can't fake data convincingly. The first thing that people get wrong when faking data is assuming that each number 0-10 has an equal chance of being used. They do not. In the real world, numbers are much more likely to start with a '1' than any other digit. 
From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998
Benford's law can be used to predict the frequency of numbers. As you can see fro mthe above, it matches closely to real-world data sets. It predicts that '1' is the most likely first digit, then '2' less so and so on and so on.
When you see the analysis of fraudulent data sets, it really comes to life:
From "The First-Digit Phenomenon" by T. P. Hill, American Scientist, July-August 1998
From the same article:
Benford's law can be used to test for fraudulent or random-guess data in income tax returns and other financial reports. Here the first significant digits of true tax data taken by Mark Nigrini from the lines of 169,662 IRS model files follow Benford's law closely. Fraudulent data taken from a 1995 King’s County, New York, District Attorney's Office study of cash disbursement and payroll in business do not follow Benford's law. Likewise, data taken from the author's study of 743 freshmen's responses to a request to write down a six-digit number at random do not follow the law. Although these are very specific examples, in general, fraudulent or concocted data appear to have far fewer numbers starting with 1 and many more starting with 6 than do true data.
Back in 2005 I worked with a team of organisers who would report the number of doors that each of their teams had knocked on each day. I thought it would be fun to see how that data compared to Benford's law. Overall, you see that it doesn't look like people were being honest:
Buit its not all bad news. It looks like some were more honest than others:
4

View comments

  1. So, help me understand how this would apply to fake coin flipping data?

    ReplyDelete
  2. First, I'm certain its not about the total number of heads or tails. Comparing that across the class would allow you to estimate the proportion of the class that cheated, but not to identify individuals.

    I suspect therefore its about the number sequence itself. Probably the number of sequential strings of H or T.

    If it were me I would convert the sequence in to numbers and then look at the distribution of these numbers. I would do this by listing the length of each 'run' of heads or tails in order.

    E.g. HHTHTTHHTHTHHHT would become 2,1,1,2,2,2,1,1,1,3,1

    Lets have a stab at the maths:

    Once you have flipped your first coin, the probability of the second flip being different, and ending the 'run' is 50%. So we should see '1' occur about 50% of the time in our list. (In my above made-up list, it is 6/11 - not too bad!)

    The probability of the second flip being the same is also 50%. And the probability of this 'run' ending on the next flip is 50%. This means the total probability of getting a run of length two is 25%. (In my made-up flips above, I had '2' 4/11 times, or 36% of the time. Oops!)

    Similarly, the probability of getting a run of length n is (50%)^n.

    All the lecturer needs to do is convert each student's list of coin flips in to the number sequence above and do a statistical test to understand whether the difference from the expected pattern is less than 5% likely to be down to chance (or similar). 200 coin flips should give about 100 numbers once converted, which seems to be a decent sample size.

    Anyone want to chip in on what the statistical test should be?

    By the way, here's another (probable) application of Benford's law to coin-flipping in the real world: http://paul.kedrosky.com/archives/2008/07/21/hedge_fund_test.html

    (From a blog I highly recommend, by the way)

    ReplyDelete
  3. Very cool! Who is organizer 4? Do I have to search around through my decommissioned computers for zone 4 to find the answer to this?

    ReplyDelete
  4. That would be very cruel of me. The organizer numbers have been changed. I don't think we should be surprised by the findings though, should we :)

    ReplyDelete
  1. I gave a talk at the Big Data Insight Group in London recently and they've just posted my talk online.


    I talk about how we've helped EMI Music make use of data and about how we're doing so in zeebox.

    One of the themes throughout my talk is the importance of people. Both in terms of how we use data to help people make decisions and about how we need to understand the people we're trying to help, in order to give them what we need. Technology enables this, but without the right people and without understanding people, technology is as good as useless.


    I also talk about how important skills and judgement are. And that, although it's sometimes seen as the things that drives decisions, it's usually or perhaps always used alongside skills and judgement. 


    I think that admitting to the role of skills and judgement isn't being 'anti-data'. I think that being honest about this enables and empowers us to better use data in the right ways. And it certainly helps people to feel comfortable with data, also!

    With the right people in place and data playing the right role in an organisation, the opportunity for data to help an organisation is massive. The way that EMI Music has embraced data across the organisation alongside skills and judgement shows that this is the case.


    0

    Add a comment

  2. We all know there are decisions where you need data to help you make them and there are decisions where data just isn't that important. This morning XKCD did a wonderful job of illustrating it. http://xkcd.com/1036/

    Buying a lamp is a creative decision. Turn your eye away from the reviews and go with your heart :)

    The same is true of many decisions data folks are asked to help with every day in organisations. We shouldn't be afraid to champion this strategy there, either!

    0

    Add a comment

  3. We sat down recently to talk data and insight. Here is what we talked about, plus a little video of me talking about insight at both zeebox and EMI.

    http://www.thebigdatainsightgroup.com/site/article/david-boyle-emi-zeebox-data-driven-includes-video
    0

    Add a comment

  4. I don't like the term 'scientist' as it makes the role sound unaccessible and elite. Google's Hal Varian said "the sexy job in the next ten years will be statisticians" ... but I don't like that term either. I'd replace 'statisticians' with 'working with data' or something ... and then I believe it!  I think data people have a tendency to overplay the role of the 'statistics' and magic of it and underplay the importance of the 'bringing it to life' and 'helping people understand / make use of it' parts of working with data.

    I thought about this because of this cool article in The Guardian about data scientists.

    As it points out, "science" is defined as "systematic study of natural or physical phenomena". I guess that's us all. Perhaps I shouldn't shy away from that phrase.

    The journalist describes the role well, as "someone who can bridge the raw data and the analysis - and make it accessible. It's a democratising role; by bringing the data to the people, you make the world just a little bit better." Perfect, eh?

    One last quote: "the four qualities of a great data scientist are creativity, tenacity, curiosity, and deep technical skills." That list sounds pretty good to me, also. So perhaps I should rename this the 'data scientist' blog and be done :)



    0

    Add a comment

  5. Some fun from http://fosslien.com/ via http://www.freakonomics.com/2012/02/29/the-life-of-the-number-crunching-analyst/


    I particularly like this one:

    0

    Add a comment

  6. So much data, so easily displayed in such a small but easy to understand format. I need say no more. I'm in love with the new sparlklines just made available in Google Spreadsheets: http://support.google.com/docs/bin/answer.py?hl=en&answer=2371371


    It's this simple:

    Google Spreadsheets is rapidly becoming my go to choice for building business dashboards. Bye, bye cost. Bye, bye developers (would be VERY sad not to work with them, of course). Bye, bye Microsoft!

    6

    View comments


  7. I spoke on a panel last night on the subject 'data as the new black gold'. There are three challenges I think this metaphor poses to the data world.







    First, that of crude oil. Data is everywhere in organisations, but too often left in it's crude form: gloopy and unusable. The oil industry had to work this out before it could be mainstream. It had to refine oil to a form that works for consumers day-to-day and it had to make it available to consumers in ways that fitted in to their daily life. It's trivial to stop by a petrol station and pick up some oil in a format you can instantly make use of. Data doesn't yet work the same way: it's rare to find an organisation that appropriately refines it and then makes it available to it's people in a way they can access and make use of as part of their day-to-day work.


    Second, I think we need to demand higher 'miles per gallon' from our data. Often we gather fantastic raw data, capable of being a really powerful part of decision making ... but then business leaders don't ask interesting questions of it. They don't demand smart analysis and challenge the data to offer insight. It's like demanding that cars offer higher miles per gallon from the oil they are burning.


    Finally, I think we need to embrace hybrid technology. In cars that's about oil being only part of the story for how the car gets powered. In data it's about saying that data is only part of the story for how organisations get powered. We need to be honest and bold about the role of skills & judgement alongside data in powering organisations. Too many people believe / pretend that data alone can power organisations to greatness. Everything I've seen tells me that data is necessary but not sufficient: smart people to use the data alongside their expertise is ALWAYS required. The data world should be honest about this and build data and systems around that truth. I've always found that has a much greater impact :)
    0

    Add a comment

  8. I've used a lot of word clouds recently. But I think of them as charts really, since they are still pretty faithful to the underlying data. The size of the word is proportional to the number of times that word is in the data set. Simple.

    But reading a cool data visualization book I came across this. Really it not based on 'data', but it's interesting his words and their location on the page conveys such a lot of information. Perhaps some good, well placed words can replace the need to chart actual data?

    http://creativeroots.org/2011/03/italy-infographic-map/
    0

    Add a comment

  9. Simple, easy to read, but really powerful. Nice little sparklines spotted in the papers from the 20 week scan my wife just had. Cool little chart like this should be everywhere!

    And by the way, it's a boy!
    0

    Add a comment

Labels
If you like this you'll like:
Info Clarity Archive
Loading