• Megan Yates

A Tale of 100 Jelly Beans


If you had a jar of 100 jelly beans and knew one of them was laced with poison, would you offer the jar to your kid? We sidestepped the ethics of that question and thought instead about about how we depict rare events. For any data practitioner, rare events can be tricky to model. However, the problem is more to do with the number of cases in a sample rather than just rarity. From a statistical perspective, a warehouse containing 1,000 poisoned jelly beans out of 100,000 is much easier to model than the jar containing 1 poisonous jelly bean out of 100. It’s called small sample bias.

For set samples, there are a number of statistical methods to deal with this type of small-sample bias. But there are commercial cases where rather than statistically managing the problem, and dealing with poor modelling outcomes, solutions would be more effective through combining data.

Combining data refers to the pooling of data to make the data sets larger as it were, and thereby avoiding the bias inherent in small samples. 


Fraudsters are good at math: A perfect example of this is fraud in the insurance industry. Criminals have come up with a multitude of creative ways to defraud insurance companies. These range from from non-occurring injuries, inflating claims, non-matching income and false theft claims to more extreme false death claims. In New York, Michael Danilovich defrauded auto-insurers of $279 million in false whiplash claims. In an elaborate scheme, Danilovich packed inexpensive vehicles with paid passengers, set up low-speed accidents and had the passengers claim treatment for whiplash and back and neck injuries. He also bribed a host of doctors to confirm the soft-tissue injuries, which are notoriously hard to prove. All tests and treatments were billed to the insurers. To complete the con, Danilovich’s attorneys threatened to sue insurers unless they paid up. The scale of this network of deception was quite remarkable, with doctors seeing up to 150 ‘patients’ each day, and over 100 medical clinics involved in the scam. Danilovich might have been caught earlier if the forensic investigators had larger data samples to work with.

Insurance fraud costs the South African insurance industry more than R4 billion every year. Sophisticated fraudsters typically commit fraud with more than one insurer. Data combination, where insurers pool data, would hugely improve detection of these rare fraudulent events. In our new data economy, companies have become highly cognizant of the value of data, but unfortunately, data pooling would likely only happen if driven by regulation.

Small samples can have big consequences: Unfortunately for consumers, fraudulent claims result in increased premiums being passed on to policy holders. Although not an exact figure, it is estimated that insurance premiums can be up to 30% higher because of fraud. Income and affordability are two of the biggest barriers to growing the insurance market and reaching more consumers. It’s hard to quantify how many consumers are excluded due to high premiums, but we can be certain that high insurance premiums don’t help. 

If insurers are covering much of the costs of fraud through inflated premiums and have internal  investigative fraud teams to minimize losses through fraud, how strong is the business impetus for them to effectively deal with fraud through data combination? Are regulators data savvy enough to drive change by promoting regulatory data combination in the industry or is it up to us, as data practitioners, to drive this change?

  • Black Facebook Icon

©2019 Ixio