1 of 2
DAVID SIMS: Archiving the whole Web -- can you give us some idea of the scale of that?
BREWSTER KAHLE: There are about 640,000 Websites right now. It's still doubling about every six months. If you count the total number of pages, that's about 2 terabytes. A terabyte is a million megabytes, so that's 2 million megabytes of material.

SIMS: How big is a terabyte? All the text in the Library of Congress is about 20 terabytes.
KAHLE: This tape robot, about the size of two coke machines, has got this arm that goes and pulls tapes from these shelves and sticks them into a bank of tape drives, and then it copies them onto hard disks.

A paper on the technical aspects of archiving the Internet is online at Internet Archive's site.

Alexa is instantiated by a tool bar or a little dashboard on the bottom of your screen that talks to your browser and then talks back to databases at Alexa that then display information about where you are and where you might want to go. So it's meta-data, trying to give you an idea who's behind the Website, how often has it changed, is it very popular, are there any security warnings -- those sorts of things.

SIMS: And where else might you want to go from here? And that information is difficult to compute. That's the real crux of the Alexa service.
KAHLE: The Holy Grail in all of this is usage paths. Where have other people who have been on this Website go? Where did they go that they had a good time? Where's the good stuff? After somebody has sorted through the search engines and directories. Other people have found the good stuff; why can't I leverage that? C O N T I N U E D . . . 2 of 2
SIMS: You're tracking people's paths like a scout tracking paths through the forest?
KAHLE: Yes, but we don't care who they were. We just want to know where are the high-traffic sites that might lead to the good views or the water hole or the good things in the forest.

We're starting the system with other suggestions that come from link analysis, content analysis, and some editorial judgment.

SIMS: There's something in it called the 404 Killer?
KAHLE: Well, one thing that you get for free if you've got an archive is you've got an ability to make out-of-print Web pages come back. There are some valuable resources that are just now out of print. You don't expect all books to be continuously printed. In the same way, we shouldn't expect all great Web pages to always be on some site that will always be there. The Web changes -- something like 1 percent each week.

SIMS: How did your path lead here?
KAHLE: At MIT, we always wanted to make things that made an impact on large numbers of people.

Danny Hillis was one of the founders of Thinking Machines, where Brewster Kahle designed supercomputers in the 1980s. Hillis is now a Disney Fellow.

That was the era of VAXes, and they weren't very fast computers. Danny Hillis had this idea of constructing a connection machine with tens of thousands of processors all working together. It sounded like a good idea, so we built this thing that allowed us to process and search through gigabytes in a fraction of a second.

From there, this Internet stuff started coming around, and we said, "Well, that's not that different; it's just another network of computers. How do we make that searchable?" And that was the genesis of the WAIS project.

We have these computers that are very fast, and then the Internet came along and added the content. Now we've got something to play with. Now we have critical mass to go and try to make something so that these computers can augment and provide advice that's maybe useful to people.