How Meta is creating custom silicon for AI

By mullaned2002

October 18, 2023

182

With the recent launches of MTIA v1, Meta’s first-generation AI inference accelerator, and Llama 2, the next generation of Meta’s publicly available large language model, it’s clear that Meta is focused on advancing AI for a more connected world. Fueling the success of these products are world-class infrastructure teams, including Meta’s custom AI silicon team, led by Olivia Wu, a leader in the silicon industry for 30 years.

In the conversation below, Olivia explains how she led the silicon design team to deliver Meta’s AI silicon, allowing the company to improve the compute efficiency of the infrastructure, and enable software developers to create AI models that will provide more relevant content and better user experiences.

Tell us about your role at Meta.

Olivia Wu: I lead design development of the next generation of Meta’s AI silicon. My team is responsible for the design and development of Meta’s in-house machine learning (ML) accelerator, and I partner closely with our co-design, architecture, verification, implementation, emulation, validation, system, firmware, and software teams to successfully build and deploy the silicon in our data centers.

What led you to this role?

OW: I’ve been working in the silicon industry for 30 years and have experience working at a variety of large companies leading both architecture and design for multiple ASICs and IPs, and for startups focused on training AI. In 2018, I saw a social media post from Yann LeCun, our Chief AI Scientist, that Meta was looking for someone to help build AI silicon in-house. I knew of just a few other companies designing their own custom AI silicon, but they were mainly focused only on silicon and not the software ecosystem and products.

The opportunity for Meta (known as Facebook back then) was to bring in silicon developers to work directly with the software teams to reimagine end-to-end systems allowing for greater efficiency and larger degrees of freedom in optimizing across hardware and software boundaries.

This was very enticing to me. I knew this was a rare opportunity and I had to jump on it to have the chance to build a design team from the ground up.

How was the transition from working at two different startups to working at Meta?

OW: My transition from startup to Meta was super easy. We had a very small team, so it almost feels like a startup within a large company. I was able to get involved in many parts of the project. It gave me the opportunity to be very hands-on in all aspects of ASIC development.

Meta also has a very open culture. The freedom to innovate and experiment with new ideas is ingrained into Meta’s DNA. I was able to have whiteboard sessions with members of co-design, software, hardware, and other cross-functional teams to brainstorm features that would go into the silicon. These discussions gave me a lot of insights into Meta’s critical AI workloads, the challenges that our software teams had encountered with the current solutions, and their future directions. Coming from a startup, where we had very limited visibility into customer workloads and the roadmap outside of what is open sourced, this was very enlightening and refreshing,

What are some of the challenges you face in your current role?

OW: The silicon development cycle typically is fairly long. It usually spans anywhere from one and a half to two years, though it can take as long as four years in some cases. With AI advancing at a much faster clip, we are really designing hardware for software that doesn’t yet exist. So the silicon has to be able to handle not just the demands of AI today, but future AI as well. To do this, we have to understand what our software team needs – AI workload trends they see, features they will need – and incorporate that into our design.

This is where we at Meta have an advantage. Because our silicon and software teams are both in-house, we have a front row seat into what’s happening in software, and we are able to incorporate it into our silicon from the beginning.

MTIA v1 was the very first silicon that we built at Meta, so one of the really challenging things was having to build out the entire design and verification flow from scratch, as well as the silicon development infrastructure itself. This was a lot of work in the beginning, but it’s really paid off in the long run for the team.

Meta announced MTIA v1 earlier this year. What is the significance of this milestone to you and the company?

OW: MTIA v1 is Meta’s first generation ML accelerator. It’s customized for our deep learning recommendation model, which is an important component for Meta technologies – including Facebook, Instagram, WhatsApp, Meta Quest, Horizon Worlds, and Ray-Ban Stories. While we will continue to purchase silicon chips from our partners, designing our own silicon allows us to optimize specifically for our critical workloads and gain complete control over the entire stack – from silicon, to the system, to software and the application.

This was such a fun and unique experience, especially when I first started and the team was really, really small. We were able to fit into a conference room along with the software team and whiteboard all the different ideas and features we wanted to implement. I don’t think I’ve ever had that kind of experience anywhere else. Even though the team has grown quite a bit since then, we still try to maintain that scrappy culture.

What did you and the team learn from this process?

OW: I learned how important it is to have a hands-on team capable of jumping into other roles to get the job done. We operate in many ways like a startup in that we have to wear many hats and take on other challenges beyond our usual work. So even though I’m the design lead, in addition to leading the project development, I also roll up my sleeves to code and help out wherever is needed.

What are you looking forward to next? What’s next for the AI silicon design team?

OW: AI is central to our work at Meta. The recommendation system is obviously a big part of our AI models, but beyond that, we also have GenAI and video processing use cases that have different requirements. This brings us a lot of opportunities to create products tailored for each need.

With MTIA in-house, it gives us a tremendous amount of learnings we can incorporate in our products. In addition, we maintained the user experience and developer efficiency offered by PyTorch eager-mode development. Developer efficiency is a journey as we continue to support PyTorch 2.0, which supercharges how PyTorch operates at the compiler level — under the hood. We’re continuing to gather feedback and input from our AI software teams to shape the features of our future AI silicon.

As we work on the next generations of MTIA chips, we’re constantly looking at bottlenecks in the system, such as memory and communication across different chips so that we can put together a well-balanced solution to scale and future-proof our silicon.

What advice might you give to women or other historically underrepresented groups interested in pursuing a career as engineers?

OW: I would encourage them to actively participate and not shy away from speaking up in meetings or discussions so people can know what they can accomplish. The other thing is to look for mentors within the team. They don’t have to be the same as you. Having a mentor is always good, particularly early in your career, to help guide you and prioritize what will help you advance.

Meta’s Infra team, as well as Meta more widely, has a mentor program for women engineers and underrepresented people. We offer both a group coaching program as well as one-on-one coaching. I’ve done both of these and really enjoy having the opportunity to mentor. I’ve found that it’s very helpful for junior engineers to have the opportunity to get coaching and mentoring from senior people in the company.

What about Meta’s culture and technical advancements make it such a prime time for engineers, researchers, and developers to be at the company?

OW: Meta is an amazingly open company with a truly collaborative culture and a great place to learn and grow. We provide resources to help people quickly become familiar with the entire stack, even if they have no prior exposure to certain parts. This includes everything from the silicon to the firmware, the compiler, the application, as well as large scale system design that we are putting into the data center. The sheer scale to which Meta has been deploying the application also creates a dimension of challenges that makes it interesting and rewarding to work here.

The post How Meta is creating custom silicon for AI appeared first on Engineering at Meta.

How Meta is creating custom silicon for AI

Building new custom silicon for Meta’s AI workloads

Building an infrastructure for AI’s future

Introducing the next-gen Meta Training and Inference Accelerator

LEAVE A REPLY Cancel reply

Most Popular

The overwhelmed person’s guide to Google Cloud: week of April 25

Uncharmed: Untangling Iran’s APT42 Operations

Google is a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud AI Developer Services

Simple guide to training Llama 2 with AWS Trainium on Amazon SageMaker

Recent Comments

EDITOR PICKS

Exploring the Click Element Variable in Google Tag Manager

How to track events with Google Tag Manager and Google Analytics

Data Layer Variable in GTM: What, Why, and Where?

POPULAR POSTS

Vertex AI Search adds new generative AI capabilities and enterprise-ready features

Prepare data from Amazon EMR for machine learning using Amazon SageMaker Data Wrangler

GKE provides fully managed kubernetes support for Elastic Cloud

POPULAR CATEGORY