Updated: Oct 4, 2019
Ryan Price has experience in academia across several different theoretical & applied fields, such as statistics, economics, and finance. His professional business work spans many software development and data-related fields, mostly focused on data engineering and ETL (Extract Transform Load). His preferred toolset contains the likes of R, Python, SQL, Bash, and Docker, with a strong focus on developing reproducible & portable enterprise data software.
Note that people asked questions about ‘data science’, and ‘data engineering’. The difference is usually perceived as follows: DS is much broader, and DE is a part of DS. But they’re not mutually exclusive, in that ‘DE doesn’t do DS work’. It’s all very nebulous.
QUESTIONS ON DATA SCIENCE & ENGINEERING
Abhinav, Smahi & Jaskirat — Explain data science (DS)? What is its scope?
There is no consensus on what ‘data science’ is. There are internet flame wars about what people think it is, and is not.
What’s universally true, is that DS covers a ton of skill areas, and is one of the most interdisciplinary fields of work any of us may see in our lifetimes.
Because of that, DS is much less its own field, and much more a skillset that is applied to any field, like business, biology, chemistry, finance, healthcare, etc.
Rather than say what it is and isn't, I'd rather point out what it can be based on what I've seen, heard, and worked with.
‘Data science’ can be:
Making predictions with data
Hypothesizing about data (and what that data were drawn from)
And testing those hypotheses!
Telling a story with data
Moving data from place to place
Transforming, aggregating, summarizing data
Tons of other ‘data’ things. Basically, if it’s work that touches tabular data, you can bet someone in the field has called that work ‘data science’
I will say that there’s an adage that goes: “90% of data science is data preparation”. Another common one that someone added to that is: “90% of data prep is software development”.
So, maths: 81% of ‘data science’ is software dev.
You can get started in such work without deeper software dev experience, but having such experience will make you infinitely more valuable and efficient.
Tanmay — How are Machine Learning and data science related?
ML is a part of the vague ecosystem of DS. ML is the act of applying statistical methodologies to have a program do things without being explicitly programmed to do those things. Statistics is the fundament that allows for these things to be done.
Suraj — Is data mining a part of data science?
Yes. DM is broadly defined as “discovering patterns in large datasets.” More statistical stuff.
Keshav & Rushab — can working as a data engineer further our chances to work in the field of AI?
I’d argue that it’s necessary work to get into AI.
Shivam — What are the things that data engineers work on?
I’m including DS here
Machine learning / AI
Predictive modeling / forecasting
ETL (extract, transform, load)
Data processing, cleaning, etc.
Write a lot of code, in several languages
Shivam — Average package offered to data engineer?
There’s no way to definitively answer that. “Comfortably compensated” is probably the best answer I can give. It’s an educated tech job.
But, don’t pursue a career path because of the potential compensation!
Rajnish — What kind of jobs can a data engineer get?
You can be a… data engineer.
But, just like any other field, your career path is not strictly defined by your degree (in the US, anyway).
Naman — Data science looks boring as compared to fields like web development, computer vision, game development. Is it different from what I perceive it to be?
Almost everyone I know would agree that “computer vision” is “data science”.
Depending on what web development you're doing, it might be considered “data science”; DS requires a lot more web development than you might realize.
I write server-side HTTP applications (using Flask) all the time, to facilitate data movement, transformation, etc.
Game dev that implements actual, statistical AI is an application of DS.
Naman — Do you need to have a strong mathematical background?
To actually do valuable work, yes, you do. Getting into ‘data science’ is very easy these days. Anyone can follow an online tutorial and copy-paste code, work with already-clean data, and get something out of a model. To actually work and think independently in ‘data science’, you will need a strong math and statistics background, problem-solving skills, experience in software development, etc.
Note that there are many people on the internet that will disagree with me on that, especially the math/stats needed. They are welcome to hold their own opinions, no matter how wrong they may be.
Nisha — Would you suggest learning pure statistics?
Yes, absolutely. One of the most important fields in any career, especially fields like these. Stats is roughly defined as the scientific toolset for learning more about the whole, from the part. Take a sample, and infer things about the population. Look at historical data, and predict future data. It’s all the same principle.
This is a short answer, but one that I am most passionate about.
Nisha — PhD or MS for data science?
Don’t pursue a degree just because of the name on it; explore what you want to do (content-wise), talk to people, get guidance or clarification, and decide from there.
DS skills are not standalone; you apply them to other fields of study / work. So, figure out what you’re passionate about, take courses in that, and load up on DS supporting material along the way.
Sushma — Does a master’s degree in data science limit your opportunities as compared to master’s in Computer Science?
I would say yes, simply for the reason that ‘data science’ is a nebulous term at best, and many unis have jumped on the bandwagon to offer ‘data science’ degrees that are difficult to vouch for (see above).
CS is pretty time-tested. Note though, that a pure CS degree might not prepare you as well for ‘DS-type’ work as stats, etc. might; but a CS degree is incredibly valuable, for sure, and most importantly it’s much more ‘future-proof’.
Pankaj — Is it necessary to have experience in data science for pursuing a master’s degree in the field?
Not at all. Many quantitative fields have entry requirements that aren’t blocked by previous degrees (though many do, e.g. Economics Ph.D)
Sushma & Tanmay — Important math courses that a bachelor’s student should take for a master’s in data science?
Maths: several calculus courses (enough to cover limits, derivatives and their ‘rules’, integrals), linear (matrix) algebra, real analysis, etc.
Stats: probability, distribution theory, regression models, estimation vs. inference, hypothesis testing, critical thinking (some unis have this as a separate stats course), etc.
Naman & Prakash — What are the top languages for data engineering?
First, distinguish between a language and its implementation.
e.g. SQL is not a language
Many people I’ve met think that just knowing how to write a loop in Python is enough to read through a TensorFlow tutorial, let alone hold a paying job doing such things. So, the user-facing language is not as important as the fundamental knowledge. Stats, database theory, programming principles,
That being said, the most common language interfaces you should understand well for broad success (today):
Though more rare, you may also do a lot of work in:
Java (or another JVM language, like Scala or Clojure)
Common software or platforms you should be comfortable with include:
Git, and GitHub/Bitbucket/another hosting platform
Cloud hosting providers (AWS, GCP, Azure, etc)
To be especially successful in the space, you always need to be thinking about how such work can be productionalized. Anyone can follow a tutorial on how to do some simple stuff on their own computer, but there’s not as much business value in that. Think automation, robustness, etc.
Suraj — why is Python used for data science?
Python is a general-purpose language that’s very high-level (i.e. abstract), and as such it’s very easy to learn, read, and write. But, Python’s default implementation (CPython) is very slow for intense operations like numerical computing. CPython allows for writing separate (fast) C/C++ code, and wrapping Python code around the exposed functionality. This is how libraries like Numpy, Pandas, etc. work; almost all the actual data-processing code isn’t written in Python at all.
There is a misconception about how some libraries like Tensorflow are ‘Python’ libraries, and that they exist makes Python somehow an objectively better tool for ML. In reality, Tensorflow (and friends) are all written in C/C++, and the Python library for them just operates well with what is exposed to the Python caller. R, Julia, etc. all have C/C+ callers, and can plug in to Tensorflow, etc. just as easily. The top-level library for Python just happens to be the most mature right now.
Also: “why R?” Because R’s implementations are designed from the ground up to work with tabular data. I personally only use other tabular-data libraries (e.g. pandas in Python) as a last resort, or for super lightweight work. R is always my first go-to.
Suraj — I want to have a career in data analytics but I am a back-end developer. How should I go about achieving my goal?
Don’t think like that. You’re setting yourself up for failure, in any field.
But: if you already do backend development work, you’re already well on your way! Depending on what that backend work is, I guess.
At the end of this, there will be resources I list/link that will help you get moving towards that goal.
Jatin — Since the number of software engineers is growing exponentially, what can we do to stand out?
The only thing that I might suggest universally, is to have some public projects on a GitHub, etc. page. Not copy-pasted project code, or a tutorial walkthrough, but an actual project that you worked on yourself. Something you built, a tool that solves a unique problem (no matter how small), etc.
I don’t have any hard numbers in front of me, but I have heard similar things. I have also heard that while the number of software devs is growing (in all fields), the demand for them is growing faster.
So, my advice is to just do your best, and in this current labor climate, it should all work out for you.
Sorry that’s a boring answer, but I believe it to be true.
Tanmay — What steps can I take closer to summer to work on my data skills?
Akash, Rushab & Amarjeet — Where should a person who does not have any prior experience in data science start?
These should all be free resources
First and foremost, if you want to get more into machine learning-esque work, you’ll need a strong background in statistics and maths. The best self-paced, free resource for both of those is Khan Academy.
If you want deeper stats material in that direction, eventually pick up An Introduction to Statistical Learning by Hastie and Tibshirani
Even deeper? Elements of Statistical Learning, by the same authors.
For understanding where the data you’ll use is actually stored (probably a “data warehouse”) and how it’s structured, read the first few chapters of The Data Warehouse Toolkit by Ralph Kimball.
For getting into applied ‘data science’ stuff (and programming), read R for Data Science by Hadley Wickham and Garrett Grolemund. It’s obviously focused on R, but you’ll take a lot of principles from walking through the book and its examples that are broadly applicable.
For the more CS-y side of DS, read through Data Structures and Algorithms by Granville Barnet and Luca Del Tongo. I’ve not read it myself, but I hear very good things.
There is also Designing Data-Intensive Applications by Martin Kleppmann for some applied low-level material
If you already fancy yourself a strong R user, the book Advanced R by Hadley Wickham is a fantastic exploration of the internals of the R interpreter, but the knowledge is applicable to many other programming language implementations.
Finally, you can always can reach out to me, or look through my GitHub or something. My handle everywhere (Gmail included) is ‘ryapric’
QUESTIONS ON IMMIGRATION —
Rajnish — What are the reasons that a student might want to do data specialization from outside India?
Better job opportunities, specific companies or areas that you might want to work for
Deepak, Jatin & Rajiv — How to decide, plan and apply for higher education after completing BTech?
Decide -- If you need a specialization in a field or want to switch fields
Plan -- start early so that you don’t waste time post graduation
Apply -- The applications need to be submitted before your 8th semester
Deepak & Rajiv — What are the requirements to apply abroad for higher studies?
Requirements are dependent on the country that you want to go to
US -- GPA, GRE (qualification criteria), TOEFL/IELTS (english proficiency)
Canada/Australia -- GPA, TOEFL/IELTS/PTE
Ansh — How to choose a field for a master’s degree when studying internationally?
This depends on what your interest is. For example, I did my bachelors in Mechanical Engineering and then did my master’s in engineering management. You can choose any field you’d want as long as you can prove basic qualifications. So, if you’d like to go into Computer Science or Engineering, you’d have to prove that you have some knowledge of programming. If you don’t then you’d be asked to take a few courses as prerequisites.
Pankaj — Would getting a job before applying for masters affect my chance in anyway?
Work experience in the relevant field that you want to do your master’s in would increase your chances of getting a job after the degree. Admissions in a degree does not really depend on work experience. You can do a master’s degree right after your bachelors or a year or two later. For an MBA, you need to have at least 2 years of experience.
Deepak & Rahul — Does the CGPA matter? Is 80% considered good?
Yes, GPA matters. That’s the first criteria for selection. 80% is a good grade but grades are not the only thing that would make or break the decision. It’s the first thing that universities look at but you can improve your app with other things like GRE/TOEFL/IELTS, research etc.
Ansh — Should I do MTECH or MBA?
It totally depends on you. If you want to pivot your career then you should go for an MBA but if you like what you are doing then go for a master’s degree. Plus, master’s does not require any experience but MBA does and it’s more expensive.