Scenario:
Each day we receive data from a collaborating hospital about patients' blood glucose levels. A patient has their level measured three times, and those reading are averaged together to determine if the patient's blood sugar level is normal, pre-diabetic or diabetic (a blood sugar level less than 140 mg/dL (7.8 mmol/L) is normal, more than 200 mg/dL (11.1 mmol/L) after two hours indicates diabetes, and a reading between 140 and 199 mg/dL (7.8 mmol/L and 11.0 mmol/L) indicates prediabetes). Typically a file will contain all three readings for a patient, but occasionally the hospital's lab information system is out of sync and we will receive some readings for a patient at a later date.
The data we receive is in CSV format, and each file is named after the date it was transferred (2020-10-28 in the example attached). The files are uploaded each morning to the same directory in a shared S3 bucket. The files contain protected health information (PHI), which we are not allowed to store (PHI includes names, addresses, hospital identification numbers, etc., anything that could be used to personally identify the patient).
Goal:
Design an ETL application to run each morning that ingests the new CSV file and persists the data in a database or the data file format of your choice. Assume that eventually the volume of data will eventually grow to multiple TB and design your application accordingly.
Steps:
1. Make assumptions and justify them where things are unclear with comments in the code
2. Write tests to ensure that your code and the data is correct
3. Remove protected health information (PHI)
4. Remove any invalid values and normalize where reasonable
5. Add a column that calculates the average of all three glucose measurements (if present)
6. Add a column that indicates whether the patient's glucose levels are normal, prediabetes, diabetes, or unable to be determined
7. Account for late data (for example, if we receive two readings in one day's CSV file and the third reading in the next day's file)
Hi,
I hope you are doing good.
I read your job post and I am the best match for this wonderful opportunity as I have 5+ years of relevant experience in required skills.
Let's have a quick discussion.
Thanks
Virang Patel
Hi,
Me and my team have read the information you have provided and we will be more than happy to work with you and ensure the quality of work and results you expect
to receive.
You won't regret your decision of choosing us for this project because of trackable records of my team.
Creativity and efficiency are the hallmarks of our team and we have all the expertise to get your job done.
We have nothing but satisfied clients on the board and we're sure that you will also be one of them. We also ensure you the most reasonable pricing than the market.
Please contact us. Thank you.
Hello Hiring Manager,
I read your job descriptions carefully, I am very interesting in your job of ETL pipeline
I have the enough experience and good project done with good client feedbacks.
Let me know if we can discuss further?
Regards
Himanshu
Hi,
I am an experienced Data Engineer with a solid background in Spark.
I have worked on many Big Data projects with Spark, Scala, Python, Cassandra, Snowflake, AWS,...
I suggest to use Snowflake as a datawarehouse and ELT also. It can easily retrieve data on S3.
Let's have a call for more details about the project and my solution.
Regards
Hello,
I am an experienced AWS developer & can do this task to extract data from csv in S3 buckets & store it in a db.
Let me know if you only want to store the data or also want to visualise it somewhere.
Thank you.
Dear client!
I've read your job description carefully.
I have more than five years of experience in Development.
Your satisfaction with the project is my top priority!
If you give me a chance to work with you, then I will do my best level for your project to run successfully.
I'm available in your time zone for your job and also will work full time on this project from now.
Best Regards
I believe that I can help you developing the ETL process for ingest the data into a database and apply the calculations to delivery if the pacient is normal, prediabetes, diabetes.
Firstly I will create a diagram to keep clear the strategy for the ETL and understand the data platform that you have, in the description you mention S3 then I supposed that you use AWS Cloud, if yes I will need to understand the resources available in AWS to elaborate the process because my background is using Microsoft Azure.
Because of this I will accept the payment just when I finish the project and about the hours that I will need to study about AWS I won't charge.