Description:
You will be responsible for designing, developing, and maintaining scalable data pipelines to support data-driven initiatives. The role requires a strong understanding of data architecture, database management, and programming skills to handle large volumes of data.
Job Responsibilities:
- Writing, maintaining, and debugging web crawlers and scrapers to extract data on large scale
- Using APIs to fetch data and store the data in databases (SQL or NoSQL)
- Parsing data extracted from various sources and performing data cleaning and transformation
- Implement data quality and data validation processes to ensure accuracy, consistency, and reliability of data
- Processing extracted data to extract business insights and taking appropriate actions
- Develop and maintain robust, scalable, and high-performance data pipelines and ETL processes
- Design, build, and optimize data models, databases, and data warehouses for storage and retrieval of structured and unstructured data
- Maintaining repositories using version control tools, like Git, and deploying programs on servers
Job Qualifications:
- No degree requirement - we will test skills through our recruitment process
- 0-2 years of experience in Python. Preference will be given to candidates with experience in data engineering technologies, database management, web automation, and web scraping
- NOTE: Recent graduates are welcome to apply but they MUST showcase exceptional academic, freelance, or personal projects
- Experience with data transformation and data warehousing tools, like dbt and Snowflake
- Familiarity with different databases (like MySQL, MongoDB, and PostgreSQL) and experience of understanding and writing complex SQL queries
- Experience with scraping libraries and frameworks of Python (like Selenium, BS4, and Scrapy) and Requests module
- Knowledge of API integration to implement complex workflow automations
- Experience with CI/CD and version control tools, like Git and Github
- Familiar with UNIX & Shell Scripting
- Bonus: Knowledge of AWS (EC2, RDS, Glue, EMR, Lambda, S3, etc.), Azure (ADF, Databricks, etc.), and GCP (BigQuery, Compute Engine, Functions, etc.)