Sr Software Engineer - Data Engineer / Python

Apply now »

Date: Nov 16, 2023

Location: Pune, MH, IN

Company: Houghton Mifflin Harcourt

HMH India
Houghton Mifflin Harcourt (HMH) is a learning technology company committed to delivering
connected solutions that engage learners, empower educators and improve student
outcomes. As a leading provider of K–12 core curriculum, supplemental and intervention
solutions, and professional learning services, HMH partners with educators and school
districts to uncover solutions that unlock students’ potential and extend teachers’ capabilities.
HMH serves more than 50 million students and 4 million educators in 150 countries.
HMH Technology India Pvt. Ltd. is our technology and innovation arm in India focused on
developing novel products and solutions using cutting-edge technology to better serve our
clients globally. HMH aims to help employees grow as people, and not just as professionals.


For more information, visit


Job Description: Senior Software Engineer Data Engineer / Python 


Position Overview:


We are seeking an accomplished Senior Machine Learning Data Engineer to join our innovative team, working in synergy with our Large Language Model (LLM) Software Architects and Principal Engineers. This pivotal role focuses on harnessing structured and unstructured data, conducting advanced data analysis, and building efficient pipelines for training machine learning models, including LLMs. Your expertise will drive the development of cutting-edge language processing solutions.


Key Responsibilities:


  • Design, develop, and optimize end-to-end data pipelines for collecting, processing, and preparing diverse datasets, including structured and unstructured data sources.
  • Perform in-depth data analysis, exploration, and transformation to ensure data quality and suitability for training machine learning models. Employ advanced techniques to handle unstructured text data effectively.
  • Collaborate with machine learning researchers and engineers to create meaningful features from complex data, enhancing the performance of language models and related applications.
  • Source, curate, and integrate external datasets to enrich the training and evaluation of machine learning models.
  • Work closely with ML researchers to shape the data to fit model requirements, ensuring optimal model performance and generalization.
  • Develop data processing architectures that scale to handle large volumes of data efficiently, ensuring consistent and reliable model training.
  • Interface with machine learning model training pipelines, integrating data preprocessing steps seamlessly to support model development and iteration.
  • Collaborate with cross-functional teams, including software architects, machine learning researchers, and data scientists, to understand requirements and align data engineering efforts.
  • Establish and enforce data quality standards, data governance best practices, and data privacy considerations in accordance with regulatory guidelines.
  • Create and maintain comprehensive documentation detailing data sources, processing steps, and transformation logic.
  • Identify and address performance bottlenecks in data processing pipelines, optimizing for speed, efficiency, and reliability.
  • Stay current with the latest advancements in data engineering, machine learning, and natural language processing domains.




  • Bachelor's  or Master's in Computer Science, Data Science, Mathematics, Engineering, or related field.
  • 5+ years of hands-on experience in data engineering, with a focus on structured and unstructured data processing for AI/ML needs.
  • Strong understanding of machine learning concepts and their application to data engineering for model training.
  • Proficiency in designing and building data pipelines using tools like Prefect, Apache Spark, Apache Airflow, TensorFlow Data Pipeline, Athena or similar frameworks.
  • Familiarity with MLops using MLflow and/or Sagemaker, Docker, K8s, CICD etc
  • Expertise in programming languages such as Python and SQL. Familiarity with Scala and scripting languages like Bash is advantageous.
  • Proven ability to analyze and transform complex structured and unstructured data into usable formats for machine learning.
  • Familiarity with data storage technologies, such as Snowflake, Apache Redshift, S3, Postgresql.
  • Familiarity with data integration tools(such as Informatica) and ELT, ETL techniques to blend data from various sources.
  • Knowledge of cloud platforms (AWS, Azure, GCP) and related services for scalable data processing.
  • Excellent teamwork and communication skills to collaborate effectively with cross-functional teams.
  • Strong analytical skills to address complex data engineering challenges.
  • Ability to adapt to evolving technologies and business needs in a fast-paced environment.


Houghton Mifflin Harcourt Technology Private Limited is an Equal Opportunity Employer and
considers applicants for all positions without regard to race, colour, religion or belief, sex, age,
national origin, citizenship status, marital status, military/veteran status, genetic information,
sexual orientation, gender identity, physical or mental disability or any other characteristic
protected by applicable laws. We are committed to creating a dynamic work environment that
values diversity and inclusion, respect and integrity, customer focus, and innovation. For more
information, visit Follow us on Twitter, Facebook, LinkedIn, and

Job Segment: Curriculum, Social Media, Education, Publishing, Marketing

Apply now »