Tuesday, 14 February 2012

Law Of First Digits & How It Might Lead To More Trust

In 1938 a physicist called Frank Benford wrote a paper about something that he had noticed concerning collections of numbers.  These data sets were real life situations and were surprisingly diverse ranging from bills, to populations, to death rates.  Being a physicist Benford captured the law mathematically

P(d)=\log_{10}(d+1)-\log_{10}(d)=\log_{10} \left(1+\frac{1}{d}\right).
But it is best understood by considering the numbers in base 10 ie digits 1 through to 9, and drawing out the probability of finding each digit as the first digit of any number in one of the real life data sets that Benford considered.  The result is:

Hence, if you study one of the relevat data sets you find that 1 is the first digit approximately 30% of the time and 9 about 5% of the time.  The phenomenon had been noticed some years earlier in 1881 by a Canadian mathematician called Simon Newcombe but Benford was the one who did a lot of work that the law held good across a wide variety of data types, and so the law bears his name.

Whilst Benford's law was empirically derived it can be mathematically proved that it applies exactly to a whole range of naturally occuring number such as the  are the Fibonacci numbers, the factorials, the powers of 2, and the powers of almost any other number for that matter.  Plus it can be shown to apply to numbers in other bases such as base 16. As late as 1995 a mathematician called Theodore Hill published a paper that showed that the law could be generalised to apply not just to the first but to any digit within a number.  Hence, the probability that d (d = 0, 1, ..., 9) is encountered as the n-th (n > 1) digit is:


Very interesting if you are a student mathematics and its use in modelling the world in which we live, but one might think that is as far as its usefulness goes. Not so.

As far back as 1972 an econmoist called Hal Varian (now working for Google) suggested that one could use Benford's law to differentiate between social-economic data had been manufactured or was derived from real life situations.  And suddenly the light bulb went on.  Why could Benford's law not be applied to detect fraud in a range of data sets such as tax returns or electoral fraud?  Extensive tests showed that Benford's law could, within certain limitations, give a reliable indication of fraud within data sets.  To this day evdience based upon Benford's law is admissable in most US courts.

All of which brings us to the present day when we are presented with the ever increasing volumes of data that enter our lives electronically.  The Internet now holds over a zettabyte of data and we are constantly having to make judegments about whether to trust that data.  It might be as simple as whether an image has been altered right through to whether large statistical datasetsshould be used to make a critical buiness decision.  Which makes me ask if there is not some way to apply Benford's law, and its generalised forms, to help us decide whether or not we can trust some electronic data we may be about to rely upon.

Trust is such a fundamental aspect of how we use the Internet that at the very least this is an area that is worthy of more research.

Monday, 6 February 2012

Security Flaw in eBanking Affects Over 100 Million Users

I sometimes find myself slightly irrated by having to prove that I am a human online, most often by using the Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHAs).  You've all had to use them at some point: those funny, distorted versions of a piece of text that only a human can decipher:



I content myself with the fact that these do provide a degree of security against web bots, and that maybe my irritation is more to do with my eyesight than anything else.  And, so, I have lived with this sense of security for some time. 

I then noticed that CAPTCHAs were being used by some financial institutions, in some cases as part of a transaction verification process, which I assumed was again for protection against automated attack.  Although not common in the UK, various banks in countries such as US, Germany, China and Switzerland make extensive use of CAPTCHAs.  And, this is not just a few banks.  These are potentially used by over a hundred million customers.

The use in eBanking made me wonder just how vulnerable they were to attack: is it only a human that can decipher them.  Would a computer not be just as good as a myopic Professor?  I was shocked when my friend Dr Shujun Li told me that not only were CAPTCHAs vulnerable but that he had demonstrated how you could successfully attack nearly 100% of those he had found in eBanking.

So, how does the attack work?  By combining a series of image and text processing techniques that have been know for a long time.  Specifically:

  1. Segment objects from a CAPTCHA image
  2. Image processing for removing noises/decoy objects and refining shapes of segmentation
  3. Detect random grid lines used in some e-banking CAPTCHA schemes
  4. Image inpainting for removing unwanted objects from a CAPTCHA image
  5. Character segmentation for extracting characters from a CAPTCHA image
  6. Character recognition for recognizing characters in a CAPTCHA image


Hence, having recovered the text from the CAPTCHA presented as, say, part of a transaction verification process, a piece of malware could simply tranfser that text to the required section of the banks web page and thus appear to be a valid human.  Obviously this form of forging CAPTCHAs has to be used in particular attack scenarios but these were tried and succeeded.  A piece of malware would have conduct a man-in-the-middle attack, which are still quite rare but gaining a foothold.  For a full desciption of the attack scenarios go read the paper.  The attack was tested against a large range of eBanking systems and the results show quite conclusively that effectively 100% could be compromised.

When this attack was run using a standard laptop one attack scenario took only 150ms to complete successfully.  Hence, a computer cannot only do what a myopic Professor can, it can do it a lot faster.  Imagine this in an automated mass-attack.  Imagine a trojan that was on machines in, for example, China where it stole a few pennies from millions of customers.  Imagine what the bank's response might be.  Might the bank not be tempted to say that it must have been a human that was involved and hence you must have revealed your password.  Afterall it can't have been a machine because of the CAPTCHA process.

But, the thing that disturbs me a lot more is that Dr Li did this research some considerable time ago, and it was published, and he told those affected, yet those very same financial institutions continue to use this method.  I don't intend to say who those institutions are here, but if you are such an institution please take note of this research and move to something like hardware/two-part authentication.  Nothing is perfect, but it is clear that CAPTCHAs are not the answer.