What is data science?
Karen Tao, UX Researcher
December 26, 2019
Data scientist has been called “the sexiest job of the 21st century” (Davenport & Patil, 2012). What is data science exactly? In 1996, the term data mining was coined for the process of gaining meaningful insights from massive data (Fayyad, Piatetsky-Shapiro, Smyth, 1996). William S. Cleveland formally coined the term data science in 2001, making the field of statistics more technical to include models and methods for data, as well as computing with data (Cleveland, 2001). In the early 2000s, social media sites such as Facebook and YouTube became popular, resulting in vast volume of data as users interact with one another by liking, sharing, and commenting. The amount of data became too much to handle using traditional technologies. The field of big data developed as parallel computing technologies such as MapReduce, Hadoop, and Spark emerged. Research in 2013 found that 90% of all the data in the world had been generated over the two years prior to the research (SINTEF, 2013). With sufficient amounts of data, machine learning and artificial intelligence models finally dominate over other aspects of data science.
A typical data science project generally involves first defining a problem. For example, online retailers may wish to recommend items to customers who have just purchased an item. The next step is data collection. The retailer may gather all the purchase data from all of their customers. The retailer is now ready to move on to data preparation, which is to check for consistent data types and missing values. During data analysis, the retailer may perform feature selection to decide which variables will be used in model development. For the core activity of data science, the retailer may develop a model for a recommendation system by clustering. Finally, a summary of recommended products may be presented as visuals for executives who can make decisions about what to stock for inventory.
In popular culture, data science is often considered as equivalent to machine learning or artificial intelligence. However, our short journey through history and example of a data science project show that modeling is only one part of the entire picture. Clearly defining the problem, collection of data, data preparation, and effective data visualizations are all crucial components in a data science project. Machine learning is generally considered a subfield of artificial intelligence. Machine learning requires large amounts of data and valid statistical methods to produce an output, such as a prediction. The performance of the machine may improve over time, and the prediction made is closer to the actual result, as more data is given. Therefore, data science and machine learning are closely related, yet not the exact same practice.
At the UDRC, our researchers have used regression models to perform analysis on academic outcomes for working college students , as well as logistic regression to determine the strongest predictors of those who experience intergenerational poverty. Stay tuned and we will explore further into statistical methods used in our research reports more closely in the near future, as well as our innovative data science approaches to future challenges .
REFERENCES
Cleveland, W. S. (2001). Data science: an action plan for expanding the technical areas of the field of statistics. International statistical review, 69(1), 21-26.
Davenport, T. H., & Patil, D. J. (2012). Data scientist. Harvard business review, 90(5), 70-76.
Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37-37.
SINTEF. (2013, May 22). Big Data, for better or worse: 90% of world's data generated over last two years. ScienceDaily. Retrieved December 18, 2019 from www.sciencedaily.com/releases/2013/05/130522085217.htm