1. Introduction to Data Science In this unit, students will obtain basic knowledge about Data Science. More specifically, students will learn what data, information and knowledge are, as well as the difference between them. Special mention will be made of the topic of the Data Science Life Cycle, as well as dealing with big data. Data governance and policies will also be discussed. Finally, students will learn about the Data Science fundamentals, also focusing on the career opportunities that Data Science offers. Learning Objectives In this unit, you will learn to: > Define Data Science. > Differentiate between data, information and knowledge. > Recognize the differences between Data Science and Business Intelligence. > Examine the convergence of Data Science and Artificial Intelligence. > Identify the stages of the Data Science Life Cycle. > Describe what Big Data is. > Identify the characteristics of Big Data. > Categorize Big Data technologies. > Define what data governance is. > Identify data governance principles. > Discuss the skills and tools Data Science requires. > Identify professions related to Data Science. > Understand the importance of Data Science online communities. ونا Ministry of Education 8 2024-1446 Python programming prerequisite The Data Science and Engineering curricula in the pathways system require knowledge of Python programming basics. You can scan the QR code on the right to access introductory Python content. To find out what topics are available and for quick access to each unit, you can see pages 208-209.

1: Data, Information and Knowledge

Introduction to Data Science

Learning Objectives

Python programming prerequisite

Lesson 1 Data, Information and Knowledge Link to digital lesson www.ien.edu.sa Data Science The importance of Data Science lies in the fact that data has become an essential part of industry, because companies require data to function, grow and improve their businesses. Data helps companies in making proper decisions through data-driven approaches that analyze a large amount of data to derive meaningful insights. Data Science application areas: Commercial and industrial applications. Healthcare, bioinformatics and natural sciences. Digital economy, social media and social networks analysis. Smart homes, smart cities, smart transportation. Education, e-learning, entertainment. Energy, sustainability and climate. Data and Information We are surrounded daily by data. We receive information from television, newspapers, books and the Web. But what is the difference between data and information? Data is a representation of facts or ideas in a formalized manner capable of being communicated or manipulated by some process. For example, Figure 1.1 shows a collection of a student's personal data. When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. For example, Figure 1.2 provides organized information about a student. On this student card, you can see information such as the name, home address, telephone, email and date of birth. Data Science Data Science is the domain of study that deals with vast volumes of data using modern tools and techniques to find unseen patterns, derive meaningful information, and make business decisions. Example Estishraf, a National Information Center (NIC) online platform, applies advanced data science technologies to its database to generate valuable insights in more than 50 decision-making scenarios. Data The representation of facts or ideas in a suitable format for storage, processing, or transmission. Information A set of processed, organized, and structured data that provides context and enables decision making processes. Mohammad 14 Bader street 05******** mohammadsa.bl@outlook.com 16th April Figure 11: Unstructured data Ministry of Education 2024-1446 STUDENT CARD Name: Mohammad Home address: 14 Bader street Telephone: 05*** *** ** Email: mohammadsa.bl@outlook.com Date of birth: 16th April Figure: 1.2: Information 9

1: Data, Information and Knowledge

Data Science

Data Science application areas:

Data and Information

Data Science

Example

Data

Information

Raw Data and Information Raw data is data that has just been collected from various sources and has not yet been processed for use. Data usually refers to raw data. Once the data has been analysed, it is considered information. Let's think about some examples: > The number "8122001" is considered raw data because it is a value with no contextual meaning. Now, if this value is presented as: "8/12/2001, your date of birth" then this is information, as it provides knowledge about a certain matter. > Each student's test score is one piece of data. The average score of a class or of the entire school is information that can be derived from the data. Information for Further Processing Data or information from different sources can also be combined together to create more powerful datasets. This process is called data blending. For example, you can combine information from the marketing and sales departments to understand which marketing campaigns were more successful and profitable for a group of products. Table 1.1: Differences between data and information Data Unstructured. Presented in the form of numbers, figures, or statistics. No dependencies. Derived from user or computer system inputs. Information Has a logical structure. Presented through reports, graphs, or plots. Dependent on data. Derived from data processing. Knowledge Knowledge is our understanding of the world. In other words, it is the appropriate collection of information in a way that makes it useful. We can say that when a person understands some information about something, then they have knowledge about it. Information becomes knowledge when critical thinking, evaluation, structure, or organization is applied. Let's look at the example in Figure 1.3: The data you can see at the bottom is a list of words having no context. Now, if we organize this data, we can provide information. Let's suppose that this is a list of the sales of ice cream flavors from yesterday. A bit of analysis is useful to glean more information. For example, the most popular flavor of ice cream sold yesterday was chocolate. The knowledge is that the shop manager can see that chocolate is the most popular ice cream flavor. The next time he places an order, he will ask for five times as much chocolate ice cream as mocha ice cream. وزارة التعليم Ministry of Education 10 2024-1446

1: Data, Information and Knowledge

Raw Data and Information

Information for Further Processing

Differences between data and information

Knowledge

° ° о ° о о ° ☐ ☐ Knowledge ☐ ☐ ☐ ㅁㅁ ㅁㅁ ㅁㅁ ☐ ☐ Information Γ ORDER LIST mocha chocolate Data Figure 1.3: The Data - Information - Knowledge pyramid 1 kg 5 kg ICE CREAM FLAVOR SALES mocha vanilla chocolate strawberry mocha chocolate vanilla vanilla chocolate chocolate strawberry vanilla chocolate chocolate Table 1.2: Differences between information and knowledge Information Meaning A refined form of processed data. Knowledge Relevant information that leads to conclusions. Predictability Not sufficient to make predictions. Transfer Can be transferred easily through verbal, written or electronic means. Provides the ability to predict or make decisions. Requires learning of the subject. Outcome The outcome is understanding. Objective⚫ وزارة التعليم Ministry of Education 2024-1446 Answers the questions of who, when, what, and where. The outcome is comprehension. Answers the questions of how and why. 11

1: Data, Information and Knowledge

The Data - Information - Knowledge pyramid

Differences between information and knowledge

Data Science and Business Intelligence Data is everywhere around us, and it is used, processed and analyzed in every field today. At the same time, data is constantly evolving and is used in several business applications, like Business Intelligence. Business Intelligence is a technology-driven process that analyzes data, providing important information that helps executives and managers make careful business decisions. While both Data Science and Business Intelligence involve data, they are different from one another. Data Science is much more complex compared to Business Intelligence. The scope of Business Intelligence is limited to the business domain. In Business Intelligence, past data is analyzed by developing dashboards, creating business insights, organizing data and extracting information that would help the business to grow, with the final goal being the understanding of the current trends of the business. However, in Data Science, we use data to make future predictions and forecast the growth of the business, using a wide array of complex statistical algorithms and predictive models. Additionally, Business Intelligence tools are limited to analyzing organizational information and setting up business strategies. On the other hand, the tools of a data scientist involve complex algorithmic models, data processing and even big data tools. Business Intelligence A data-driven system that incorporates data collection, data storage, data analysis, and data visualization to support decision making. Table 1.3: Differences between Data Science and Business Intelligence Scope Data Science Data is used to make future forecasts for the development of the business. Business Intelligence Past data is analyzed to understand the current trends of the business. Tools Data types It includes complex algorithmic models, data processing, and even big data tools. The tools are limited to analyzing management information and overseeing business strategies. It works with structured data, but mainly deals with unstructured and semi-structured data. It works with structured data that is typically data warehoused or stored in data silos. Complexity It has more complexity compared to business intelligence. Flexibility وزارة التعليم Ministry of Education 12 2024 -1446 It is much simpler compared to data science. It is much more flexible as data sources can be added as required. It is less flexible as data sources must be pre- designed.

1: Data, Information and Knowledge

Data Science versus Business Intelligence

Business Intelligence

Differences between Data Science and Business Intelligence

Data Science and Artificial Intelligence Data science has already been defined, and you are aware that Artificial Intelligence (AI) is another field that deals with massive amounts of data. These two technologies can be used independently to solve difficult challenges and they can also converge and complement one another. Data science processes historical data using computational tools to describe situations (descriptive analysis), predict results (predictive analysis), and provide recommended solutions to problems (prescriptive analysis). The most commonly used tools are statistical and management tools, which enable the analysis of historical data. On the other hand, Al employs a variety of techniques to mimic the way people think, decide, and solve problems. Rather than focusing on computation, the emphasis when working with Al tools is on knowledge and intelligence as critical elements for solving problems. Additionally, Al is concerned with cognitive computing. This distinction is less obvious in practice because sophisticated data science projects often include machine learning (an Al discipline) to facilitate data analysis in both prediction and prescription. Data science and machine learning provide significant contributions to many organizations when used independently. However, traditional data analysis techniques are unsuitable when working with incomplete or inaccurate data, or the business or scientific contexts are changing so quickly that accurate data becomes obsolete very quickly. Similarly, machine learning technologies require a relatively significant amount of data. Therefore, the next generation of data science tools and business intelligence platforms use machine learning to conduct, for example, pattern recognition to discover hidden patterns and visualize crucial insights. In addition, machine learning and deep learning support data science with more accurate predictions. The availability of large datasets and the reduced cost of processing on the cloud empower machine learning with capabilities not possible in the past. When data science and Al are combined, they create synergies that provide significantly superior results and lead to better and faster decisions. Artificial Intelligence (AI) A computer science field that focuses on building systems capable of performing tasks that usually require human intelligence, such as learning, reasoning, problem-solving, language and perception. ليم Example Saudi Aramco has created a new "Corporate Digital Factory Department" supported by data scientists and machine learning experts who seek out operational challenges and develop intelligent solutions to help improve business performance. The company is actively promoting Al-inspired solutions to utilize billions of data points collected by geologists and petroleum engineers over the decades. As Aramco has always been an early adopter of Al technologies, data science and machine learning tools are used to improve the performance of reservoirs deep below the surface. Advanced Al techniques optimize field development plans and well trajectories, leading to cost reduction and improvement of the environmental impact. The company's geologists have deployed Al tools to study the data collected faster and more efficiently than ever. This process improves the understanding of the petro-physical properties of the terrain to be explored and drilled and enhances decision making. illajuja Ministry of Education 2024-1446 13

1: Data, Information and Knowledge

Data Science and Artificial Intelligence

Artificial Intelligence (AI)

Corporate Digital Factory Department

Data Science Life Cycle Through their experience working in data science projects, data scientists and data. professionals follow specific steps to implement each new project successfully. This process, called the Data Science Life Cycle, has five distinct stages. This model has numerous variations that extend the stages to cover special projects, such as Al and machine learning projects, or to represent the internal processes of specific organizations. 5 Data visualization 4 Exploratory data analysis 1 Problem definition and formulation 2 Data collection 3 Data preparation and cleaning Figure 1.4: The Data Science Life Cycle stages 1. Problem Definition and Formulation In order to design and create a solution for a Data Science problem, we first need to understand what the problem itself is. A thorough analysis of the problem, its environment and the variables that affect it are crucial for developing the solution. The understanding that we have of a problem can greatly improve or hinder the development of its solution because it directly correlates with our approach to that solution. The next objective is to define the goal we want from that solution. A dataset always contains the same data, but the answers we want to derive can vary. Problem definition and formulation Understanding the objectives and requirements of a business or scientific problem and converting this knowledge into a data analysis problem. Table 1.4: The most common types of data analysis Get the quantities or qualities that exist in the dataset التعليم Regression analysis Classification analysis Organize the data into categories Clustering analysis Anomaly detection analysis Recommendation engines Organize the data into groupings Find oddities or rarities in the data Give an informed decision on a specific question Ministry of Education 14 2024-1446

1: Data, Information and Knowledge

Data Science Life Cycle

The most common types of data analysis

Problem Definition and Formulation

Problem definition and formulation

2. Data Collection After we have set our objectives, we need the dataset itself. Besides manual entry of data, the most common way is data mining or data gathering. In this stage, enough data must be collected for further processing. The data itself can come from a variety of sources. Environmental sensors or mobile applications and web platforms continuously generate data. This data is automatically stored in databases. Data Collection The process of gathering and measuring data, including data acquisition, data labeling, and data improvement. Table 1.5: The most common data storage formats Formatted files JSON, XML, CSV, Spreadsheet XLS Relational Databases Microsoft SQL Server, Oracle Database, Oracle MySQL Non-Relational (NoSQL) Databases MongoDB, Azure Cosmos DB, AWS DynamoDB Graph Databases Neo4j, AWS Neptune, Dgraph Time-Series Databases InfluxDB, AWS Timescale 3. Data Preparation and Cleaning Data cleaning, or data wrangling, is one of the most important stages in the Data Science Life Cycle. The data scientist must clean and prepare the collected data from the data mining stage to ensure they are suitable for the subsequent analysis stage. When we combine multiple data sources, there are many chances for data to be duplicated or mixed up, and these issues will need to be fixed. If there are corrupted or incorrectly formatted data, duplicate or false data, or just incomplete data, the insights derived in the analysis stage will be false, and it will be very difficult to deduct whether the problem with the false insights originates from errors in the analysis steps or uncleaned data. This is why taking the time and the effort to clean and validate the data thoroughly before analyzing it is highly .important for the entire process التعليم Ministry of Education 2024-1446 Data Cleaning The multistage process of reviewing and correcting data to ensure it is in a standardized format, including handling missing values, smoothing noisy data, and resolving inconsistencies and duplicates. 15

1: Data, Information and Knowledge

Data Collection

The most common data storage formats

Data Preparation and Cleaning

Data Cleaning

4. Exploratory Data Analysis We have collected and thoroughly cleaned our data, and now it is time to analyze the dataset we have gathered and derive the desired answers to our questions. Data analysis is performed with data analysis tools or programming code and the relevant code libraries. It can start with a relatively simple analysis of one or more variables and expand to more sophisticated processes involving advanced statistics. Nowadays, the most prominent method of analyzing a dataset is Machine Learning. To analyze data with Machine Learning, we need to follow specific steps. We first need to define the Machine Learning (ML) model. We do this by first specifying what the input and output values are. The next step is to construct the analysis algorithm itself. This is a complicated process, and specialist data scientists and machine learning engineers are sometimes used solely for this task. After the algorithm is completed, it is time to train and test the model. When the training and testing phases are completed, we can then use the production data and finally generate the answers we want. Exploratory data analysis The approach to analyzing datasets to summarize their main characteristics, often using visual methods. 5. Data Visualization The analyzed data are usually tables of new data that are useful in the experienced eyes of data analysts. Working with a visual representation of the analysis helps to derive better insights. Graphs, plots and charts, or even maps, along with formatted reports, provide an efficient way to see and understand trends and patterns in data. When working with massive amounts of information, visualization of the results is essential to make data-driven decisions. Data Visualization A graphical representation of information that highlights patterns and trends in data and aids the reader in gaining quick insights. Caroravirus Company Manage Ssas Coronavirus Report-Powered by SAS Viya Back to Summary 2019-20 Novel Coronavirus Outbreak Global Cases and Analysis of SARS-CoV-2 Select a time base ine Select country/region A Agis Cases an of May 31, 2020 SAS Viya (%) by Venable Dimension Date (EP) grouped by Country Megion 13079 175 LIMIS Agerkies LIVE 4.2% Autolle Auris 215 2777 LING GAMIN LINE 123 Country/Region sas Source, Disclaimer and Data information Figure 1.5: COVID-19 outbreak analysis with SAS Visual Analytics. 2022 SAS Institute Inc. وزارة التعليم Ministry of Education 16 2024-1446

1: Data, Information and Knowledge

Exploratory Data Analysis

Exploratory data analysis

Data Visualization

Exercises 1 Read the sentences and tick ✓ True or False. 1. Data Science is a multidisciplinary field that focuses on extracting meaningful information from data. 2. When data is processed, organized, structured or presented in a given context so as to make it useful, it is called knowledge. 3. Information is obtained from data analysis. 4. Knowledge is the appropriate collection of data in a way that makes it useful. 5. Graphs and charts provide information. 6. Forecasts are considered knowledge. 7. Data Science, Artificial Intelligence and Business Intelligence are three fields that coexist independently. 8. Working with a visual representation of the analysis helps to derive better insights, so as to aquire better knowledge. 9. Recommendation engines and Regression analysis are part of the data storage procedure. 10. Time-Series Databases and Non-Relational (NoSQL) Databases are part of the data collection procedure. وزارة التعليم Ministry of Education 2024-1446 True False 17

1: Data, Information and Knowledge

Read the sentences and tick True or False.

2 Create a list of data and then convert the data into meaningful information. How does a computer convert data into information? 3 Mention three basic differences between Data Science and Artificial Intelligence. Justify your answers providing examples. 4 Contrast and compare Data Science and Business Intelligence. If you owned a trading company, in which of the two fields would you invest? وزارة التعليم Ministry of Education 18 2024 -1446

1: Data, Information and Knowledge

Create a list of data and then convert the data into meaningful information. How does the computer convert data into information?

Mention three basic differences between Data Science and Artificial Intelligence. Justify your answers providing examples.

Contrast and compare Data Science and Business Intelligence. If you owned a trading company, in which of the two fields would you invest?

UT How effective is the convergence of Data Science and Artificial Intelligence? Search the internet and find two successful examples. 6 Explain what Data Science is and identify three applications in everyday life for health, business and entertainment. Why is Data Science so important for these applications? 7 Compare and contrast sets of unprocessed and processed data that describe the annual grades and performance of a student. What insights can you get from datasets like this? Can you predict the academic performance of the student at university? وزارة التعليم Ministry of Education 2024-1446 19

1: Data, Information and Knowledge

How effective is the convergence of Data Science and Artificial Intelligence? Search the internet and find two successful examples.

Explain what Data Science is and identify three applications in everyday life for health, business and entertainment. Why is Data Science so important for these applications?

Compare and contrast sets of unprocessed and processed data that describe the annual grades and performance of a student. What insights can you get from datasets like this? Can you predict th

8 Find more information on Saudi Aramco's "Corporate Digital Factory Department" and identify three examples of the use of Al in data mining. What do you think about its effect on their operational practices? 9 Search the internet for Data Science life cycle models that describe the key stages mentioned in this lesson in more detail. Select one of them, identify the additional stages and briefly explain them. وزارة التعليم Ministry of Education 20 ZU24-1446

1: Data, Information and Knowledge

Find more information on Saudi Aramco's "Corporate Digital Factory Department" and identify three examples of the use of AI in data mining. What do you think about its effect on their operati

Search on the internet for Data Science life cycle models that describe the key stages mentioned in this lesson in more detail. Select one of them, identify the additional stages and briefly