Over the past few years, a lot has been happening in AI in both hardware and software. We have seen new algorithms, new processing techniques, and new AI chips. Jim Keller, CTO and President of Tenstorrent, an AI startup, sheds light on these cutting-edge technologies in an interview with EFY.
Q. What motivated you to get involved in a startup after multiple roles in corporate companies? How has been your experience at Tenstorrent?
A. When I was at Tesla, a whole bunch of AI startups were coming and trying to pitch Tesla on their AI stuff. Then I went to Intel, which was one of the challenges of a lifetime—a team of 10,000 people! When I left Intel, I thought about starting a new company from scratch, but the AI revolution had already started. So, I joined the company (Tenstorrent).
I was also their first investor. Ljubisa Bajic started the company and he called me and said, “Hey, we have this new idea to do an AI processor that’s different and better” and I gave him what’s called the angel investment. We thought we could bring forth something unique by combining a really great AI processor and a GPU together in a way no other AI startup was doing.
But for certain reasons, I also took over the business side—operations, HR. and legal stuff. And I enjoyed that kind of work as well. In a small company, you get to do these things from scratch. You get exposed to the details of everything. It’s very refreshing. It’s a big contrast from a big company.
Q. How are AI programs different from traditional ones?
A. So, first of all, AI programs are very different from regular programs. In those programs, there’s a serial or sequential flow. You have some branches back and forth. You may have many processors, but each one is running threads. It’s easy for humans to read it because humans write the code.
AI programs say something like this, “Take some information, represent it like an image or a very long string of sentences, and then multiply it by a very large array of numbers, and then do that a thousand times.” As you multiply by the numbers, you’re finding out the associations of bad information with previously stored information in some subtle but distributed way. It goes through two steps; you train the model (the set of operations is called a model) and you have an expected result.
Say, I want to complete this sentence, or I want to identify an object in a picture. When you start the model, it has no information in it. So, as you train the model, it starts to understand the relationship between the inputs and the stored information. And that’s the AI revolution.
Q. Why do you feel we need to go above and beyond GPUs when it comes to AI processing?
A. The number of calculations you do in AI programs is very large. As it turns out, GPUs were better at running lots of math than regular CPUs. GPUs are actually built to run programs on pixels, which are independent. It was not a bad start and, obviously, people had real success with speeding that up.
If you actually look at the code for GPT3—when they trained it, they used five to ten thousand GPUs in a very large cluster. That must have cost something like a hundred million dollars! Also, the program itself is probably just a thousand lines of PyTorch. So, there are more GPUs than lines of code!
And some of the lines of code, say something like, “Do a matrix multiply that’s 10,000 by 10,000”—that’s a very large amount of computation. To actually run that program on 10,000 GPUs is very complicated because the GPUs don’t just collaborate like 10,000 computers in one big thing. There are multiple layers. There might be about seven to ten layers of software depending on how you define it.
Hence, something different is needed here. For example, one of the things we at Tenstorrent like to do is—you write a thousand lines of code, we have a popular compiler that figures out how to break that problem up on a large number of processors. Our compiler can target from one to many chips. Right now, we’re working on the first 256 chips and we’re going to work our way up to 1000, which we think would be an interesting number for these kinds of training problems.
Q. What, according to you, is the right way to balance the power and performance of an AI chip?
A. Some AI models have very large sections of data. You would think making a really big RAM and putting the processing next to it would work. The problem with that is that every time you want to read the data, it has to read across the big RAM, which is a high-power process.
So, the other way to do it is to take the data and break it into small pieces, and then put the processing next to the small piece. That’s how you get the power efficiency of having the data local to the processing, and not having to go so far across the chip—because a lot of power is used in moving data across the chip.
And you want the data and the processing to be local, but you also want enough data there to be interesting from a computing point of view. So, that’s one part. The other is that you want the data from one computation to go right to the next computation. You want to keep all the data on the chip and have it move through the pipeline without getting stuck, delayed, or written to memory. So, these two steps make the computation much more power-efficient.
Q. Most AI systems suffer from bottleneck problems. How can one create a perfect sync between data sharing and processing?
A. Keeping most of the data on-chip would solve this issue. The bottleneck is in the processing, and not the memory. So, at the chip level, we can work around that bottleneck by keeping the data on-chip.
At the higher level, in the long run, this is going to be solved by reading data into AI models and having these AI models talk to each other, instead of re-reading lots of data over and over. Like when you learn a new thing, you don’t re-read all the stuff you’ve ever learned, right? You keep updating yourself. For example, when you add a word to a language model, it’s one word. You don’t add all the words you’ve ever learned. That’s a really interesting dynamic.
Q. What are your thoughts on open source software and hardware, especially in AI systems?
A. Intel, when they built their CPU, became the open hardware standard because they did a very good job of documenting, exposing their instruction set, and providing tools so everyone could use it. Way back, Intel architecture was built by seven different manufacturers. People were willing to write Assembly language programs for that.
Now on GPUs, the low-level instruction set is actually somewhat difficult to use, and the GPU vendors provide all the compiler software. You can write code at a high level and then compile it through the hardware. The GPU vendors actually change their instruction set almost every generation, so the user never sees the hardware directly. Tenstorrent is building the hardware and the software stack both. Now, we’re going to open-source that software stack. So, people, if they want to, can go to the hardware level.
Q. Could you elaborate on a few concepts that have come up recently—like Software 2.0 and brain-like execution?
A. The big idea is, in Software 1.0 people write programs to do things. For 2.0, people use data to train models. For example, you can train a chess program with a billion chess moves. Or you can build a model of chess and a simulator, and then have the simulator compete with itself and slowly learn what the good moves are.
Where do you get the data for Software 2.0? You could get data from the simulation, from scraping the internet, etc. The data could be images, text, or scientific equations. At the hardware level, we don’t really care about where the data comes from. Pretty much no matter what, it turns into these graphs of computations.
You don’t want to fill the whole GPU with one big computation. But the way the models on GPUs are written, they essentially do the whole thing. Even in executing AI graphs, you go through the whole graph, no matter what. That’s not how your brain works; your brain does lots of small computations. If you’re thinking about animals, it fires up one part of your brain. If you’re thinking about a book, it fires up a different part. That’s called conditional execution.
Q. Do you see AI computing becoming more accessible to the general public than before?
A. Consumer products are very successful when they’re under a thousand dollars right now. Years ago, when an American consumer went to a store to buy a TV, and if it was under $500, he could just buy it. If it was over $500, they would go home and research it first and figure out which one to buy. Right now, due to inflation, that number is about $1200.
AI computers get expensive very fast. You can end up spending $1000 over a weekend running some models! Many startups are working to make this affordable. For example, Tenstorrent’s list price plan for AI processing is to be about 5 to 10 times cheaper than the current market rate. We think that makes it more accessible. On the software side, if we can say we have a model compiling and running easily without requiring five IT people for support, it is more accessible.