AI Platform Operations Manager

  • AI specialist
  • San Jose
  • 1 month ago
  • Full Time

About the job

Our Company

Changing the world through digital experiences is what Adobe’s all about. We give everyone—from emerging artists to global brands—everything they need to design and deliver exceptional digital experiences! We’re passionate about empowering people to create beautiful and powerful images, videos, and apps, and transform how companies interact with customers across every screen.

We’re on a mission to hire the very best and are committed to creating exceptional employee experiences where everyone is respected and has access to equal opportunity. We realize that new ideas can come from everywhere in the organization, and we know the next big idea could be yours!

We are looking for an experienced  AI Platform Operations Manager  to lead Operational Excellence for the Adobe AI Platform and its underlying infrastructure. This is an outstanding opportunity to sol ve hard operational challenges in the AI age, including platform efficiency, observability, and scalability across training, inference, and data platforms. You will engage regularly with Platform users including Research and Applied ML engineering teams as well as senior leadership to understand these challenges. You will develop  policies  and help enforce them in close collaboration with platform engineering and SRE . The role requires you to coordinate cross functionally with Adobe’s Finance, Product and Program teams, as well as with our vendor partners, to ensure seamless and efficient end-to -end performance of the platform.

 Ideal candidate for this position: 

 Proven experience driving business operations in a deep ly technical environment such as an AI platform, cloud infrastructure, or other large-scale distributed computing environments. 

 Proficient in data pipelines and BI tools for dashboards and observability tools (e.g., Prometheus, Grafana), incident tracking systems. 

 Comfortable working closely with large CSPs s uch as AWS, Azure and GCP, specifically on their Kubernetes products such as EKS / AKS / GKE. 

 Solid understanding of the ML lifecycle. 

 Demonstrated success in achieving goals through fostering positive relationships with diverse collaborators in multi-functional settings. 

 What you’ll do: 

 Actively collaborate with Platform Users, SRE, Platform Engineering , Finance, Adobe Product teams a s well as Adobe’s Cloud Service Providers to plan and m anage GPU capacity for AI Training workloads and Inference deployments. 

 Drive and own day-to-day operations of AI platforms in close partnership with globally dispersed platform engineering and SRE teams. 

 Collaborate multi-functionally to gather insights into platform usage patterns and user feedback to develop effective platform policies, improve user support, and influence the platform roadmap. 

 Deliver key capacity scaling, optimization and cost efficiency goals. 

 Track and drive operational improvements for key metrics including GPU utilization, waste elimination, as well as platform user satisfaction . 

 What you will need to succeed: 

 10+ years’ experience in Engineering operations /program management or related field. 

 BS or MS in Computer Science or related program. 

 Proven technical leadership and create alignment amongst subject matter experts. 

 Deep understanding of cloud platforms including compute , networking, storage, monitoring, and automation. 

 Proven success leading large-scale, complex programs involving cross-functional and geographically distributed teams. 

 Strong knowledge of DevOps practices, CI/CD pipelines, infrastructure as code and service reliability. 

 Strong influencing and interpersonal skills, including relationship building and collaboration within diverse, cross-functional teams. 

 Strong technical aptitude and in software/system design and development methodologies (including Agile) 

 An intrinsic ability to deal with ambiguity and having a flexible and adaptive approach 

 Analytical perspective to problem-solving, attention to detail organizational skills, combined with the ability to synthesize and communicate strategic insights. 

 Proactive and adaptable demeanor, with a “no task is too big or too small” approach to problem-solving and execution. Requires minimal direction in an ambiguous context to take action and adapt quickly. 

 Acute Degree of Ownership and Grit: You do not let go until a problem is solved for good.