- Participate in the development of MLOps/AIOps machine learning platform and open source communities
- Responsible for the foundational research and product development, and continuously improve the R&D efficiency
- Responsible for feature development, algorithm optimization of the platform, improving user experience and usability through cutting-edge or mature technologies
- Participate in or lead design reviews with peers and stakeholders to decide amongst available technologies;
- Review code developed by other developers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).
- Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback.
- Bachelor’s degree or equivalent practical experience in computer science or related areas.
- 2 years of experience with data structures or algorithms in either an academic or industry setting.
- Good communication and writing skills in English environment.
- Proficient in Python language, familiar with typical deep learning frameworks (TensorFlow/PyTorch) and models such as CNN, Transformer, GBDT, LR, etc.;
- Experience in developing MLOps features, including workflow orchestration, model training, model serving, monitoring/observability, versioning of data, code, model, data pipeline, logging, etc.
- Familiar with communication backends (MPI, NCCL, RPC, MQTT), GPU CUDA, and other core modules of deep learning frameworks, those who have participated in the development of specific modules of famous deep learning frameworks are preferred
- Experience with federated learning, distributed training on large-scale model is preferred
- Combine the platform and the open source library to improve the training efficiency of deep learning end-to-end through task scheduling, elastic disaster recovery, performance optimization and other measures, involving K8S/KubeFlow, network optimization, and distributed training
About the job
FedML, Inc. (https://fedml.ai) aims to provide an end-to-end machine learning operating system for people or organizations to transform their data to intelligence with minimum efforts. FedML stands for “Fundamental Ecosystem Development/Design for Machine Learning” in a broad scope, and “Federated Machine Learning” in a narrow scope. At the current stage, FedML is developing and maintaining a machine learning platform that enables zero-code, lightweight, cross-platform, and provably secure federated learning and analytics. It enables machine learning from decentralized data at various users/silos/edge nodes, without the need to centralize any data to the cloud, hence providing maximum privacy and efficiency. It consists of a lightweight and cross-platform Edge AI SDK that is deployable over edge GPUs, smartphones, and IoT devices. Furthermore, it also provides a user-friendly MLOps platform to simplify decentralized machine learning and real-world deployment. FedML supports vertical solutions across a broad range of industries (healthcare, finance, insurance, smart cities, IoT, etc.) and applications (computer vision, natural language processing, data mining, and time-series forecasting). Its core technology is backed by more than 3 years of cutting-edge research of its co-founders, who are recognized leaders in the federated machine learning community. Recently, FedML has raised around 2 million USD to scale up the product and engineering team.
FedML’s researchers and software engineers develop the next-generation platform for machine learning and artificial intelligence. We’re looking for researchers or engineers who bring fresh ideas from all areas, including machine learning and its applications in compute vision, natural language processing, data mining, as well as large-scale system design and implementation for distributed/cloud computing/systems, security/privacy, mobile/IoT systems, networking, Web UI design and development. As a software engineer, you will work on a specific project critical to our customers’ needs. We hope our engineers to be a faster learner and be enthusiastic to tackle new problems as we continue to push technology forward. You will design, develop, test, deploy, maintain, and enhance software solutions.
CEO – Salman Avestimehr
Salman Avestimehr is a world-renowned expert in federated learning with more than 20 years of R&D leadership in both academia and industry. He has been a Dean’s Professor and the inaugural director of the USC-Amazon Center on Trustworthy Machine Learning at University of Southern California. He has also been an Amazon Scholar in Amazon. He is a United States Presidential award winner for his profound contributions in information technology, and a Fellow of IEEE. https://www.avestimehr.com/
CTO – Chaoyang He
Chaoyang He is PhD from the CS department at the University of Southern California, Los Angeles, USA. He has research experience on distributed/federated machine learning algorithms, systems, and applications, and published papers at top-tier conferences such as ICML, NeurIPS, CVPR, ICLR, AAAI, and MLSys. He also has rich experience in industry in the areas of distributed/cloud computing and mobile/IoT systems. He was an R&D Team Manager and Principal Software Engineer at Tencent, and also worked as researcher/engineer at Google, Facebook, Amazon, Baidu, and Huawei. He has received a number of awards in academia and industry. Homepage https://chaoyanghe.com
We prefer California, but we also support office in other states or work from home if necessary.