
招DevOps工程师一枚,偏GPU Cluster方向,base在上海,全职,要求英语口语可以日常无障碍交流。跟我一个部门,感兴趣的同学可以加我微信:jiaszwx,直接跟老板聊


JD详情见下

公司: 博世(中国)投资有限公司

【职位JD】
Company Description
Do you want beneficial technologies being shaped by your ideas? Whether in the areas of mobility solutions, consumer goods, industrial technology or energy and building technology - with us, you will have the chance to improve quality of life all across the globe. Welcome to Bosch.
Job Description
Wording in an international DevOps team, you will be responsible for the operation and development of the GPU cluster for AI Deep Learning Platform.
Development of additional features for the service, such as rollout new software, implementation of new cluster interfaces(e.g. restful API, load balancing)
Implementation of performance monitoring (e.g. dashboards)
Automation & Deployment (e.g. patch management, integration of new compute nodes into cluster)
Preparation and execution of maintenances for all clusters, e.g. for security updates, compatibility testing and rollout.
Resolution of user incidents via various channels, e.g. issues with GPU devices or scheduling system, user issues in cluster usage (e.g. access, compute jobs, software management.)
Software deployment and maintenance (e.g. new versions)
Sysadmin housekeeping tasks (config cleanup, etc.)
Build, expand, maintain knowledge base
作为博世全球GPU集群DevOps团队的一员,负责作为AI深度学习平台的GPU集群的持续开发与运维
在现有平台既有服务的基础上,开发新的功能模块(例如,restful API, 负载均衡等)
开发平台的性能监控等功能(例如,可视化面板)
自动化部署(例如,软件包管理,将新增计算节点接入到集群等)
博世全球各个GPU计算集群的运维。例如,安全包更新、兼容性测试、扩容等。
通过各种渠道支持用户,解决可能出现的问题。例如,GPU设备的问题、系统任务编排的问题、以及客户使用集群时可能出现的其他问题(访问、计算任务、软件管理…)。
软件包的开发与运维。例如,新版本迭代。
系统管理员的日常任务。例如,配置项的刷新等。
建立,并持续的维护、丰富共享知识库。
【Qualifications】
Major in Computer Science, Mathematics, Engineering, or relevant technical discipline (bachelor or master)
3+ Years of hands-on experience with Linux and DevOps.
Deep knowledge in general Linux server administration, such as Linux system management, networking, security and container technologies.
Python for operation automation, Kubernetes and Docker for schedule, MLflow deployment
Software development experience is a plus.
Know-How in GPU computing domain(CUDA, cuDNN, NCCL, tensorflow, pytorch, CST etc.)
Network and GPU Hardware basic knowhow (Network, Performance, Model)
Good teamwork and cooperation with global team
Quick learner for new data technologies
English(Listen/speak/read/write).
计算机科学、数学、工程或相关技术专业(本科或硕士学历)
3年以上在Linux, DevOps方面的实操经验
熟练掌握linux服务器管理,并深入了解相关知识。例如linux系统管理,网络,安全,以及容器技术等
熟悉Python自动化运维相关脚本技术,有Kubernetes and Docker 或者MLflow相关经验
具备GPU计算领域相关知识将作为加分项。例如CUDA, cuDNN, NCCL, tensorflow, pytorch, CST等。
有相关网络及GPU 硬件基本知识比如网络和硬件性能
能与全球团队较好的进行团队合作。
具备快速学习并掌握新数据技术的能力。
英语(听,说,读,写)