Monday, June 17, 2024

STARTUP: A Deep Dive into AI Processing

- Advertisement -

Tenstorrent Inc, an AI startup from Canada, aims to build a specialized software and hardware stack that can perform billions of operations per second! In an interview with EFY, Jim Keller (CTO and President of Tenstorrent) sheds light on their technology, his journey, and some cutting-edge concepts in AI. 

Jim Keller, CTO and President of Tenstorrent

Q. What motivated you to get involved in a start-up after multiple roles in corporate companies? How did you join Tenstorrent?

A. I’m the CTO & President at Tenstorrent and I was also their first investor. Ljubisa Bajic started the company and he called me and said, “Hey, we have this new idea to do an AI processor that’s different and better”, and I gave him what’s called the angel investment – which is a small investment enough to support him and two other guys, to work for 6 months to a year to make the proof of concept.

I had worked with him before at AMD. He had a unique combination of knowledge of chip design – he knew how GPUs worked – and he was a very good programmer, a good mathematician. He was one of the few people I met who had knowledge of four different domains. He understood what the AI algorithms were doing and could translate that to software. And, he actually knew enough about chip design that I thought he could do it.

I was at Tesla at that time and a whole bunch of AI startups were coming and trying to pitch Tesla on their AI stuff. Then I went to Intel, which was one of the challenges of a lifetime – a team of 10,000 people! When I left Intel, I thought about starting a new company from scratch but the AI revolution had already started. So I joined the company (Tenstorrent). We thought we could bring forth something unique by combining a really great AI processor and a GPU together in a way no other AI startup was doing.

- Advertisement -

“Today the AI revolution is big and we’re going to bring something interesting with the combination of GPUs and AI processors”

But for certain reasons, I took over the business side – operations, HR and legal stuff. And I enjoyed that kind of work as well. In a small company, you get to do these things from scratch. It’s very refreshing! It’s a big contrast from a big company.

Q. How are AI programs different from traditional ones?

A. So first of all, AI programs are very different from regular programs. In those programs, there’s a serial or sequential flow. You have some branches back and forth, you may have many processors, but each one is running threads. It’s easy for humans to read it because humans write the code.

AI programs say something like this: “Take some information, represent it like an image or a very long string of sentences, and then multiply it by a very large array of numbers, and then do that a thousand times”. As you multiply by the numbers, you’re finding out the associations of bad information with previously stored information in some subtle but distributed way. It goes through two steps, you train the model (the set of operations is called a model) and you have an expected result.

Say, I want to complete this sentence or I want to identify an object in a picture. When you start the model, it has no information in it. So, as you train the model, it starts to understand the relationship between the inputs and the stored information. And that’s the AI revolution.

Q. How would you explain Tenstorrent to a CXO who doesn’t have a technical background?

A. The number of calculations you do in AI programs is very large. As it turns out, GPUs were better at running lots of math than regular CPUs. GPUs are actually built to run programs on pixels, which are independent. It was not a bad start and obviously, people had real success with speeding that up.

What Tenstorrent’s idea is that – there’s a GPU and an AI program – and they figured out how to write software to connect the two. We know what AI programs look like. Programmers write them in Tensorflow and PyTorch. Then we asked ourselves, “What’re the basic operations that this is doing? Let’s make those as easy as possible to run”.

We’re not emulating the AI program, we’re running the AI program very close to how it’s written. And we think that gives us efficiency. We think it gives us a better pass to write new AI programs and compile them into the hardware. So that’s our premise.

We think it also gives us a platform to explore new models because some of the current AI programs have been optimized to run well on GPUs or CPUs, but it’s causing limitations in the programs. Our mission is to run the current programs really well, but also create a platform that lets you write new AI programs that run very fast.

Q. What is Tenstorrent working on currently?

A. If you actually look at the code for GPT3 – when they trained it, they used five to ten thousand GPUs in a very large cluster. That must have cost something like a hundred million dollars! Also, the program itself is probably just a thousand lines of PyTorch. So there are more GPUs than lines of code!

And some of the lines of code, say something like “do a matrix multiply that’s 10,000 by 10,000” – that’s a very large amount of computation. To actually run that program on 10,000 GPUs is very complicated because the GPUs don’t just collaborate like 10,000 computers in one big thing. There are multiple layers. There might be about seven to ten layers of software depending on how you define it.

One of the things we like to do is – you write a thousand lines of code, we have a popular compiler that figures out how to break that problem up on a large number of processors. Our compiler can target from one to many chips. Right now, we’re working on the first 256 chips and we’re going to work our way up to 1000, which we think would be an interesting number for these kinds of training problems.

“We have some good prototype data about how this works and that’s the thing we’re working on now”

Q. Are we correct to understand that about 256 chips have got produced so far?

A. We have two generations of the product. The first one is called GraySkull. Our first-generation part has a PCI Express Interface and plugs into a server (like an AMD or an Intel server). We can plug in 8 cards with two chips each. We’ve produced a couple thousand and we’re in production.

Our second-generation part is very similar in terms of math and AI. It has some improvements – it has sixteen 100 gigabit ethernet ports on it. So we can hook them with each other in the form of a mesh. Think of a chip with four ethernet ports on each side. You can hook them up in something like a 32 by 32 chip mesh. That’s the product for which we now have the first 256. The challenge is how many of them can we hook together and then use our simple software stack to compile across many chips.

Q. What is meant by Network communication hardware that’s present in each of Tenstorrent’s processors?

A. Our AI processors are called Tensix, and between each is a connection we call the NoC (Network-On-Chip). When you go from chip to chip, there’s an ethernet connection there, which is a very different electrical transport. But we have a logical layer that simply says – from processor to processor, connect a pipe between them and push the data through. So at the low-level software, you just see Tensix processors and pipes. Then there’s a layer that we wrote underneath that decides whether it’s going over the NoC or the ethernet between the chips, and we don’t expose that to the programmer. That’s just handled.

Q. How did you build your first prototype? What hardware was used in it?

A. The first thing they did was, that they wrote the model and software, and then they put it on an FPGA. So when I made the investment, they were able to run that FPGA to 100 MHz and demonstrate the capability of the AI processor and some of the software. For the second prototype, they basically put it on what’s called the “shuttle wafer”. It is a very cost-effective way to build a prototype. The first chip was called Jawbridge, which was very small but it demonstrated all the capabilities of the NoC and the Tensix core, and the first generation software stack.

Q. If we look into GraySkull, we can see that you talk about both hardware and software parts. What made you pick both?

A. Intel, when they built their CPU, became the open hardware standard because they did a very good job of documenting, exposing their instruction set, and providing tools so everyone could use it. Way back, Intel architecture was built by seven different manufacturers. People were willing to write Assembly Language programs for that.

Now on GPUs, the low-level instruction set is actually somewhat difficult to use, and the GPU vendors provide all the compiler software. You can write code at a high level and then it compiles through the hardware. The GPU vendors actually change their instruction set almost every generation, so the user never sees the hardware directly.

We’re building the hardware and we’re building the software stack both. Now, we’re going to open-source that software stack. So people, if they want to, can go to the hardware level. Also, most people want a software stack that lets them program in a higher-level language. We’re trying to enable them, but not get in the way of hardware.

Q. How have you balanced the power and performance of your chip?

A. Some AI models have very large sections of data. You would think making a really big RAM and putting the processing next to it would work. The problem with that is that every time you want to read the data, it has to read across the big RAM, which is a high-power process.

So the other way to do it is to take the data and break it into small pieces, and then put the processing next to the small piece. That’s how you get the power efficiency of having the data local to the processing, and not having to go so far across the chip – because a lot of power is used in moving data across the chip.

“There’s a sweet spot – you don’t want RAM to be too big or too small”

And you want to data and the processing to be local, but you also want enough data there to be interesting from a computing point of view. So that’s one part. The other is, that when you lay out the graph, you want the data from one computation to go right to the next computation. You want to keep all the data on the chip and have it move through the pipeline without getting stuck, delayed, or written to memory. So these two steps make the computation much more power-efficient.

Q. Most AI systems suffer from bottleneck problems. How are you able to create a perfect sync between data sharing and processing?

A. We believe we can keep most of the data on-chip. The bottleneck is in the processing, and not the memory. We have a very large number of network connections. So at the chip level, we’re working around that bottleneck by keeping the data on-chip and moving through the graph.

At the higher level, in the long run, this is going to be solved by reading data into AI models and having these AI models talk to each other, instead of re-reading lots of data over and over. Like when you learn a new thing, you don’t re-read all the stuff you’ve ever learned, right? You keep updating yourself. For example, once when you add a word to a language model, it’s one word. You don’t add all the words you’ve ever learned. That’s a really interesting dynamic.

Q. Could you elaborate on Tenstorrent’s work in the Cloud Computing domain?

A. A lot of people we talked to are very used to using cloud computers. We have a development cloud so anyone could log in and try it out before buying it. And we realized that if someone likes it, they should be able to rent GraySkull chips too. There was a lot of research involved and we spent some time figuring that out. We realized that the hands-on experience of building our own cloud computer – we’re building one right now with a thousand GraySkull chips – was really useful.

We learned so much about the cloud coordination software, the network stack, and the file system. Now that’s up and running, we’re testing it with a few people we have onboarded. Overall, the mission is that we provide this foundation, so somebody can come in with a model and say, “I want to scale this up and compile it and not have to get into the dirty details”. That’s our job.

Q. Your website suggests that Tenstorrent is building a computing platform for AI and Software 2.0. Could you elaborate on that as well as your computation process?

A. The big idea is, in Software 1.0, people write programs to do things. For 2.0, people use data to train models. For example, you can train a chess program with a billion chess moves. Or, you can build a model of chess and a simulator, and then have the simulator compete with itself and slowly learn what the good moves are.

Where do you get the data for Software 2.0? You could get data from the simulation, from scraping the internet, etc. The data could be images, text, or scientific equations. At the hardware level, we don’t really care about where the data comes from. Pretty much no matter what, it turns into these graphs of computations.

You don’t want to fill the whole GPU with one big computation. But the way the models on GPUs are written, they essentially do the whole thing. Even in executing AI graphs, you go through the whole graph no matter what. That’s not how your brain works – your brain has lots of small computations. If you’re thinking about animals, it fires up one part of your brain. If you’re thinking about a book, it fires up a different part. That’s called conditional execution.

Using our technology, you can build smaller amounts of computation. You could have a very efficient graph where the computations could be small. What the first couple layers of the network does is that it finds out what the request is about and then directs the computation to one place or another. That would look like a random activation of the network. So we can do arrays of small computations and we can also do a kind of computation where you compute certain parts of your graphs.

Q. How do you manufacture your products? Are there any partnerships and collaborations regarding the same?

A. We designed a chip, wrote the RTL code and did the high-level architecture. We partnered with a company from Hyderabad called INVECAS Inc. They did the backend, and handled the chip layout and package design. Then we manufactured the board with a contract manufacturer from Canada – they have operations in Canada and Indonesia. We have contracts with test people and shipping people too.

The supply chain is amazing, it is both good and bad. For example, there might be a shortage of DRAMs or some accident in the packaging house that can cause a shortage of chips. As a start-up, you get exposed to the details of everything! Our weekly supply chain meetings always start with “I have some good news and some bad news!” We’ve actually gotten better at it by building relationships.

Q. What is your commercialization story? Are your products in their proof of concept stage, or is there a commercial rollout?

A. I’ll give you our software story, which is also true for everybody in AI. In AI startups, when you first have your big idea, you get a couple of benchmarks, and then your genius programmers get good results. So we did that. Then you do a version or two of your software which attempts to take real programs and compile them on the hardware.

While these programs and graphs look fairly straightforward, there are actually a lot of complexities in those graphs. And so we built something we call the point nine software stack, which takes many common models. We onboarded a number of customers and did proof of concepts with them to validate the hardware and the software stack. We also figured out a bunch of things to make the software stack even better and decided to do a rewrite of it, which we’re pretty happy with.

From a business point of view, we have watched some AI startups win over Microsoft or these big companies. But their hardware and software wasn’t ready for big complicated enterprise stuff. We decided that we really wanted to focus on AI startups and small researchers. We wanted to find a hundred small customers and not ten big ones.

We’ve talked to various people – medical imaging people, biotech people – and some of them have very interesting models that are different. It turns out, lots of companies also have their own research groups – they own their own models and data. It’s our business plan to focus on them.

Q. How would you define the company’s products? When we go to your products menu, the one-line description seems to be related to chips, but there is mention of boards, software stack as well as servers.

A. We’ve designed chips and then we built a compiler software on top of that. So that’s our technology. When it comes to buying it, it depends on the customer – some customers say they have a bunch of servers already, they have GPUs and they would like them to run faster. So they prefer to buy our boards but use their GPUs. Other people have got servers already, but they want to buy more with our cards in them. Certain AI startups working from home use our cloud.

Our chip is in each one of those three delivery methods, and we decided we’re going to price it so that it’s basically the same cost for AI anyway – it’s relatively price-neutral. That’s why the website has a variety of types of products. We’re not quite sure what the best way to sell this is, and we don’t want to tell people how to buy it. We want them to tell us what they want!

Q. Do you aim to make AI computing more accessible to the general public, just like Microsoft made PCs available to everyone, not just enterprises?

A. Consumer products are very successful when they’re under a thousand dollars right now. Years ago when an American consumer went to a store to buy a TV, and if it was under $500, he could just buy it. If it was over $500, they would go home and research it first and figure out which one to buy. Right now due to inflation, that number is about $1200.

“AI is moving really fast right now. Your phone has a CPU, GPU, camera processor, and an AI processor too. And in each generation, it’s getting bigger”

So AI computers get expensive, very fast. You can end up spending $1000 over a weekend running some models! Our list price plan is to be about 5 to 10 times cheaper than the Nvidia solution. We think that makes it more accessible. On the software side, if we can say we have a model compiling and running easily without requiring five IT people for support, it is more accessible.

Q. What do hiring trends look like for Tenstorrent?

A. I tell my team to look for people who are very hands-on. I’m interested in doers, people who can show me a bunch of code or a performance model or a new test bench. The real researchers are the ones who are really skilled at doing things. They’ve written lots of code. They have worked on problems and have hit a point where they can’t get it done and have found a solution. That’s really exciting.

We’re looking for people who come in and tell us what they’ve done and whether that fits with what we’re doing. I like engineers who go off with a problem and come back pleasantly surprised or even unpleasantly surprised! People who say, “This will never work, we should do it like this instead”.

“There’s a really big difference between getting A’s on tests and designing something new”

Also, engineers who go through school and then do an internship or two – that’s really useful because they learn stuff at school, and in an internship, they do real practical stuff. In my career, we’ve hired lots of fresh graduates. Sometimes that works out great. Sometimes we need the right kind of work for them to do, but if people are creative and they want to work hard, I don’t care what their qualifications are. It’s really about their hands-on skills.

Transcribed by Aaryaa Padhyegurjar


Unique DIY Projects

Electronics News

Truly Innovative Tech

MOst Popular Videos

Electronics Components