Skip to main content

The reCAPTCHA Project

Backin 2000, Carnegie Mellon University computer science graduate student Luis von Ahn, along with his advisor Manuel Blum, created a new cyber security tool called CAPTCHA.  A CAPTCHA (short for Completely Automated Public Turing Test To Tell Computers and Humans Apart) is a program designed to secure web sites from automated bot attacks and spammers by generating a test that humans can pass but computers cannot. 

The original iteration of CAPTCHA generated an image comprised of several randomly selected and distorted characters.  In order to gain access to a protected site, users had to prove that they were human and not a computer by correctly deciphering and retyping the characters.  Because computers cannot process distorted images and text as well as humans can, CAPTCHAs immediately proved effective at thwarting most automated attacks.  The technology quickly caught on and today is used to secure tens of thousands of web sites.  Eventually, though, persistent hackers and spammers figured out ways to outsmart the technology invented to outsmart them by creating programs capable of reading and cracking CAPTCHAs.  CAPTCHAs, although quite effective as a cyber security tool, became increasingly less effective over time.

Now, von Ahn has taken CAPTCHA technology a step further with the CyLab-funded reCAPTCHA Project.  Many CAPTCHAs in use today provide web users with one image of text containing only minor distortions.  Some CAPTCHAs merely provide users with plain, undistorted text, while others ask users to type in answers to simple arithmetic questions such as “What is 1+1.”  CAPTCHAs like these are susceptible to even basic, unsophisticated attacks and are easily cracked by automated computer programs.  In contrast to this, the reCAPTCHA Project builds on the original concept of using distorted textual images by adding further layers of security to CAPTCHAs ensuring that they are as effective as possible.


Security builds a digital library

The reCAPTCHA Project lies at the intersection of cyber security, technology, and the ongoing effort by several commercial and non-profit organizations to digitize printed information.  In the process of scanning and digitizing books and other print materials, approximately 20% of words are found to be inherently distorted and unreadable by Optical Character Recognition (OCR) programs.  reCAPTCHA takes advantage of this glitch in the digitization process. Each reCAPTCHA image is constructed using one of these inherently distorted words as a starting point.  The image is then further distorted by adding random warping and lines.  Finally, reCAPTCHA pairs this image with another distorted word image also taken from printed material.  The end result is a CAPTCHA that, while still readable and decipherable by humans, is more complex and more difficult for computers to read.

old books

Critics of CAPTCHAs have long decried the cyber security method as flawed, claiming that the same technological advances that allow researchers such as von Ahn to develop and create new ways of securing web sites also afford hackers and spammers opportunities to develop new programs for cracking CAPTCHAs.  Other critics dislike CAPTCHAs on the grounds that they are impossible to read, require too much work on the part of the user, and may actually serve as deterrents to some, reducing web site traffic and blog comments.  In reality, however, CAPTCHAs and their latest incarnations from reCAPTCHA are continuing to prove effective in the ongoing battle against automated bot and spam attacks.  When a 2009 poll for The World’s Most Influential person was hacked and flooded with tens of millions of spurious votes, Time added reCAPTCHA to the poll’s web site, instantly shutting down auto-voting by hackers and forcing them to resort to typing in CAPTCHAs by hand in an effort to continue manipulating poll results.  Prior to adding reCAPTCHA to the web site, hackers were able to submit tens of millions of votes.  Once reCAPTCHA was in place, hackers were only able to submit an additional 200,000 votes—proof that reCAPTCHA works. 

Beyond its obvious use for foiling bot attacks and would-be spammers, the reCAPTCHA Project has another, more altruistic purpose.  Several years after introducing the world to CAPTCHA technology, von Ahn realized that, despite taking just a few seconds to type a CAPTCHA, humans were spending hundreds of thousands of hours each day typing in more than 100 million CAPTCHAs.  reCAPTCHA technology was developed not merely with an eye toward improving cyber security, but also as a way to harness and reuse the collective human time and mental energy spent solving and typing CAPTCHAs—a concept von Ahn has dubbed “human computation.”  By constructing CAPTCHAs using words tagged as unreadable in the digitizing of books and other printed material, millions and millions of cyber users play a part every day in the digitization and preservation of human knowledge by transcribing words.  Tests have shown that reCAPTCHA textual images are deciphered and transcribed with 99.1% accuracy, a rate comparable to the best human professional transcription services.  In just the first year after launching reCAPTCHA, humans correctly deciphered and transcribed more than 440 million words, roughly the equivalent of 17,600 books.

In September 2009, the news broke that Google has acquired reCAPTCHA.

Here is a brief excerpt from a Wall Street Journal story about the announcement:

Google says reCAPTCHA's technology can help it with some of its high-profile initiatives, like scanning books and newspapers to create searchable archives. As users type in the words, they help teach computers to read scanned text, improving computer accuracy when converting scanned images into plain text, a process known as optical character recognition. "Having the text version of documents is important because plain text can be searched, easily rendered on mobile devices and displayed to visually impaired users," Google said in a blog post about the deal. Wall Street Journal, 9-16-09

Luis von Ahn continues to break ground in the fields of artificial intelligence and human computation as evidenced by his latest success story called Games With A Purpose (GWAP), which uses online gaming to build more intelligent computers.  The success of the reCAPTCHA Project, funded and supported by CyLab, is just a glimpse of the pioneering work yet to come from innovators such as Luis von Ahn at Carnegie Mellon.