Extra Credit: Benford's Law

Background

In the natural world we find many hidden, mathematical patterns. You can find the Fibonacci sequence in conch shells, fractals in plants, and the Golden Ratio in great works of architecture such as the Great Mosque of Kairouan. In this assignment, you will explore another pattern that shows up in some (large) sets of numbers: Benford's Law.

When looking at large sets of data, you would assume that the first digits of any large set of numbers would be roughly evenly distributed between 1 and 9. In fact, it is likely that about 30% of the numbers start with a 1 (vs. 5% of the numbers will start with a 9). Benford's Law includes an equation to calculate expected frequencies of the leading digits of data, and you can read more about that equation and its proof if you are interested.

Benford's Law could be just an interesting pattern that shows up in unexpected places (I found it when I was researching the popularity of World of Warcraft mods ....), but it does have a valuable use in detecting fraud in various types of data, particularly financial data, since people tend to try to construct fake data using numbers with an equal distribution of digits. It does have other uses as well, as you can read about in the article linked above.

Problem Description

For now, we will simply use this law as a way to practice a few programming skills: reading a data file, using modulo arithmetic to process numbers, looping, and managing arrays.

Your program will need to:

  1. open a file for reading
  2. process numbers to identify the leading (i.e. leftmost) digit for each number in the file. There are a variety of ways to do this.
  3. record the number of times each digit between 1 and 9 was seen as the leading digit of a number in the file
  4. report the results as a table of frequencies or a histogram or both.

In addition, you should analyze the results of the program and evaluate whether or not your testing data fits with the expected ratios given by Benford's Law. If you are able to do the calculations for the log function, go ahead. Otherwise, you can draw a histogram and see if the distribution looks approximately like the one given in the article referenced above.

Benford's Law does fit many collections of naturally occuring data but not all. If your data does not fit, don't worry. Just report your results .... this is what often happens in experimentation, even in computer science.

Input Format

Your program should be able to read in a file of numbers that are listed with one number per line. You do not know how many numbers there will be in the file. This is a short example file for initial testing: benfordexample.txt. This is a list of enrollments in Research Park Triangle colleges in the Fall of 2000.

You will probably want to be able to specify the file name on the command line. Alternatively, you could make the file name a string that is requested by the program when it starts running.

Output Format

At a minimum, your program should print a table of digits and the frequency, such as:

Digit Frequency
1 3
2 4
3 1
4 0
5 2
6 1
7 0
8 0
9 0

You can also gain extra points by figuring out how to print a histogram (which can be oriented sideways to make it easier to print in C).

Testing

You should test your program on multiple data sets. You can use the benfordexample.txt to start, but then you should test with larger data files. You can find one on your own (give acknowledgement where you got it!), or you can use one or more of these:

Note that you will need to "clean up" these data files before proceeding. The first two files just need the first few lines removed. The third one will need substantial preprocessing to remove or ignore the string between the parentheses.

You should include:

Grading

This project will be worth up to 15 extra credit points, according to the following rubric.

Acknowledgements

This project is an adaptation of the Natural Prestidigitation assignment from Stanford's Nifty Assignments collection, including sample data file links.