Cerebras Systems announced an AI inference solution for developers on Tuesday. According to the company claims, it is a much faster inference solution, 20 times faster than Nvidia’s offerings.
Cerebras will provide access to its larger chips to run AI applications, which, according to the company, are also cheaper than Nvidia GPUs. The industry-standard Nvidia GPUs are often accessed through cloud service providers to run large language models such as ChatGPT. Getting access is usually not easy for many small firms and is expensive.
Cerebras claims its new chips can deliver performance that is beyond GPUs
AI inference is the process of operating an already trained AI model to get an output, such as answers from chatbots and solving different tasks. Inference services are the backbone of today’s AI applications, as they rely on them for day-to-day operations to facilitate users.
Cerebras said inference is the fastest-growing segment of the AI industry as it accounts for 40% of all AI-related workloads in cloud computing. Cerebras CEO Andrew Feldman said the company’s outsized chips deliver more performance than a GPU. GPUs cannot achieve this level, he said. Feldman was talking to Reuters in an interview.
He added,
“We’re doing it at the highest accuracy, and we’re offering it at the lowest price.” Source: Reuters.
The CEO said that the existing AI inference services are not satisfactory for all customers. He told a separate group of reporters in San Francisco that the company is “seeing all sorts of interest” in faster and cost-effective solutions.
Until now, Nvidia has dominated the AI computing market with its gold-standard chips and Compute Unified Device Architecture (CUDA) programming environment. This has helped Nvidia to lock developers within its ecosystem by providing a vast array of tools.
Cerbras chips have 7000 times more memory than Nvidia H100 GPUs
Cerebras said its high-speed inference service is a turning point for the AI industry. The company’s new chips, which are as big as dinner plates, are called Wafer Scale Engines. They can process 1000 tokens per second, which the company said is comparable to the introduction of broadband internet.
According to the company, the new chips deliver different amounts of output for various AI models. For Llama 3.1 8B, the new chips can process as many as 1800 tokens per second, while for Llama 3.1 70B, it can process 450 tokens per second.
Cerebras is offering inference services at 10 cents per one million tokens, which is less than the ones based on GPUs. Usually, alternate approaches compromise accuracy for performance, according to industry beliefs, while the Cerebras new chips are capable of maintaining accuracy, according to company claims.
Cerebras said it will offer AI inference products in different forms. The company plans to introduce an inference service via its cloud and a developer key. The firm will also sell the new chips to data center customers and those who want to operate their own systems.
The new Wafer Scale Engine chips have their own integrated cooling and power delivery modules and come as a part of a Cerebras data center system called the CS-3. According to different reports, the Cerebras CS-3 system is the backbone of the company’s inference service.
The system boasts 7000 times more memory capacity than Nvidia H100 GPUs. This also solves the fundamental problem of memory bandwidth, which many chipmakers are trying to address.
Cerbras is also working on becoming a publicly traded company. To do this, it has filed a confidential prospectus with the Securities and Exchange Commission (SEC) this month.