Please provide Python code in a Jupyter Notebook answering the 3 questions outlined below. The data for this task can be found at the link in the "Data Structure" section.
- It should be possible to run the notebook you send back without any local/external dependencies (you may use any standard, pip-installable Python package).
- Make sure the notebook is broken up into clear sections so it's easy to read and to know where each question is being answered. Tidiness and readability are highly valued!
- You are encouraged to leave in your thought process and any explorations you did to show how you went about answering the questions. In case of any doubts, please state your assumptions clearly.
We have provided a (fake) dataset simulating user sign-ups for a health plan, available at the link below:
Download Dataset (data_science_task_dataset.csv)
- Each row represents a sign-up. However, a sign-up can include multiple family members. Fields containing commas (i.e., `ages`, `genders`, and `plans`) provide comma-separated information for each individual. For example, `PLUS, LITE, PLUS` in the `plans` field means the primary user chose Plus, and two family members chose Lite and Plus respectively.
- Some rows are for users who have only signed up, while others have paid. Paid users will have non-blank values for the `payment_time` field.
- If anything is unclear, please state your assumptions in a "Preamble" section in your notebook.
Question 1: Data Exploration & Visualization
After exploring the data (please show your work in an "Exploration" section), what do you think are **two crucial data breakdowns or plots** to be shown if you were presenting this data to the wider team? If multiple options are possible, please justify your choices.
Question 2: Time-to-Convert Analysis
Consider the `signup_time` and `payment_time` fields. In a **single, static plot**, how can you best show the distribution of time "deltas" between sign-up and payment (i.e., how long it takes for people to pay)? What is the best way to condense the relevant information and insights? The plot should ideally not need to be magnified to be understood.
Question 3: Modeling Strategy
You are given the payment amounts but don't know the underlying price function or its inputs. If you had to treat this as a prediction problem, what kind of model would you use? **Please do not actually build a model.** Base your answer on your data exploration. We are looking for a discussion of potential modeling challenges and how you would pick a model to overcome them.