Data Science Fundamentals - Data Science - ثاني ثانوي
1. Introduction to Data Science
2. Data Collection and Validation
3. Exploratory Data Analysis
4. Predictive data modeling and forecasting
Lesson 3 Data Science Fundamentals Mathematics Needed to Become a Data Scientist Data science algorithms, as well as implementing analyses and discovering insights from data, require mathematical knowledge. While mathematics isn't the only tool required for a data scientist, it is one of the most significant. One of the most critical elements in a data science project workflow is identifying and comprehending business challenges and turning them into mathematical ones. Linear Algebra Linear algebra is concerned with matrix and vector operations. This is very important because in data science models and algorithms, all the numbers and information are converted into matrices. Another technique linear algebra is used for is dimensionality reduction which is necessary for processing large datasets. Computer vision and natural language processing (NLP) are also data science fields that rely heavily on linear algebra. All the numbers and information are converted into matrices in data science models and algorithms. Discrete Mathematics Discrete mathematics specializes in logic and deduction methods, which are paramount aspects of algorithm design and are the basis for data science. Another very important field of discrete mathematics is graph theory. Graphs are used for modeling very complex networks such as gene regulatory networks. Their study in data science is valuable for the advancement of fields such as precision medicine, systems biology and many more. Probability and Statistics When the data from an analysis gets generated, a data scientist needs practical statistical and probabilistic knowledge to be able to understand and interpret that data. Measures such as the variance, correlation and standard deviation are used extensively by data scientists to gather insight into the underlying relationships of the features of a dataset. Calculus Visualizing the results from a data analysis is critical to provide insightful information through the generation of plots and graphs. Calculus is an integral part of the algorithms used for the complex arithmetic operations required in this process. Properties such as partial derivatives, linear regression and gradient descent are used pill extensively in optimization and loss calculation. Ministry of Education 34 2024-1446 @ 11 A Link to digital lesson www.ien.edu.sa a 012 022 021 W31
Mathematics Needed to Become a Data Scientist
Linear Algebra
Discrete Mathematics
Probability and Statistics
Calculus
Python for Data Science Data Science professionals generally prefer using Python for their Data Science projects. It is a high-level, object-oriented programming language that has an easy learning curve. It is easy to begin working on a project, as you can start by writing simple structured code or design and implement a solution with Object Oriented Programming (OOP) principles. The use of Application Programming Interfaces (APIs) and library modules provides access to powerful functionalities that are easy to use. There are numerous Python libraries that are used by professionals in various enterprises covering a wide variety of needs: data mining, data preparation and analysis, data processing, predictive modeling, data visualization and reporting. Going beyond traditional data science applications, Python libraries support machine learning and advanced artificial intelligence requirements. Python A high-level and general- purpose programming language which has gained increasing popularity in data science and machine learning. B 22 Introduction to Jupyter Notebook file น === ◆ IP[y]: Python scripts can be written in an Integrated Development Environment (IDE) such as Visual Studio Code or JetBrains PyCharm, or they can be written in Jupyter Notebook. Jupyter Notebook is an open-source web application which is used to develop and present data science projects with Python. The interactive environment enables data scientists to create "notebooks". A notebook integrates Python code and its output into a single document that combines visualizations, narrative text, Figure 1.9: Jupyter Notebook architecture mathematical equations, and other data visualizations. After Jupyter is installed, it runs in a web browser either online or on a personal computer. Besides Python, Jupyter Notebook supports over 100 programming languages (called "kernels" in the Jupyter ecosystem) including Java, R, Julia, MATLAB, Octave, Scheme, Processing, Scala, and many more. Out of the box, Jupyter will only run the IPython kernel, but additional kernels may be installed. We will use Jupyter Notebook for Exploratory Data Analysis later in this book. The latest web- based application for Jupyter is JupyterLab, and all notebook documents work the same .both environments التعليم Ministry of Education 2024-1446 User Web browser Notebook server IPython An example: visualizing data in the notebook Some random data, created with jupyterLab Figure 1.10: Jupyter Notebook sample screenshot 35
Python for Data Science
Python
Introduction to Jupyter
Tools for Data Science Data science is a complex process which requires a lot of steps in order to create a data science solution. For each step of the process there exist numerous tools for accomplishing the desired task. Table 1.9 presents the most popular tools for each data science step: IBM Cloud Pak for Data Projects / Austin demo / Data assets BANK_CUSTOMERS Description 2 Data asset BANK CUSTOMERS All Search No description. Columns Governance Data quality Data classes Data types Rules Keys Data quality score 96% 0% Columns 10 Rows 1000 Reviewed Data quality analysis - Quality score change Data quality dimension results Showing 10 dimensions Threshold 80% Analysis status Completed Dimension name Last analysis Nov 10, 2020 Data class violations 333 Primary key analysis Analysis status Completed Suspect values Last analysis Nov 10, 2020 Inconsistent capitalization # of findings % of findings Delta @ Ignore 333 -3% fewer 50 13 -1% fewer y-1% fewer QFind a column Values out of range 0 O changes Name (p... Score Delta Suspect values in correlated columns 0 0 changes CUSTOMER II 100% 0% Missing values D O changes NAME 100% 72% ADDRESS 98% 0% Inconsistent representation of missing values 0 O changes ZIP 13% Format violations changes CREDIT RATI... 99% 1% AGE 100% 0% Duplicated values 0 changes Figure 1.11: IBM Cloud Pak for Data sample screenshot Table 1.9: Popular tools for data science Purpose Software tools Edite Publish Analyze Download Show chart Data Storage The databases where the data is stored Data Transformation Modeling MySQL, SQL Server, MongoDB, Neo4j Tools that query the data that we want to analyze Python, SQL, Apache TinkerPop Converting the queried data into models that are appropriate for analysis Pandas, NumPy, Apache Spark Analysis The process that generates the desired insights Visualization Visualizing the results in the optimal format وزارة التعليم Ministry of Education 36 ZU24-1446 Tensorflow, PyTorch, IBM Watson, AWS Sagemaker Matplotlib, D3.js, R
Tools for Data Science
Popular tools for data science steps
Data Science jobs Data Science is one of the fastest growing and most in-demand computer-related fields today. The Misk Foundation has published a Saudi Job Market report focusing on current in-demand job roles, and Data Science career opportunities look particularly promising, especially for careers that support the goals of Saudi Vision 2030. Table 1.10: Professions related to Data Science Data Scientist Machine Learning Engineer Machine Learning Specialist Their job is to find, process and analyze data for companies and organizations. They take raw and unprocessed data and extract insights and patterns from the data that help companies and organizations analyze their performance and make mission critical decisions. They are responsible for implementing Machine Learning (ML) solutions and systems for the appropriate applications. They need to be knowledgeable in software engineering and statistics in order to be able to test their solutions and judge the correctness of the produced ML models. While ML engineers are concerned with the application of ML models, ML specialists focus on the mathematics of the specific algorithms that produce the models that engineers are then able to utilize. Applications Architect They design the information systems for organizations and companies. Enterprise Architect Data Architect Data Engineer Infrastructure Architect Data Analyst They combine business and technical knowledge, and they are in constant communication with stakeholders and technical departments. They are tasked with translating business and organization data needs into technological specifications and solutions which they forward to the technical teams. They are responsible for the storage and flow of information in a company or organization. They work with data scientists and engineers to build the appropriate data pipelines for dataset input, analysis and results output. Data engineers assist data architects in building the digital framework for data capture, storage and processing, which both data scientists and analysts will use for their work. Their role is to manage the infrastructure where data is stored and processed. They need to take into consideration factors such as data privacy, protection and infrastructure performance on the servers where the data analysis takes place. Data science projects are continuously becoming more complex, so infrastructure architects need to make sure that the data processing is completed within the appropriate timelines. Data analysts are the professionals that take the insights from the processed datasets and generate reports, visualizations and various other analytics that are aligned with the original objectives of the data science project. وزارة التعليم Ministry of Education 2024-1446 37
Data Science jobs
Professions related to Data Science
Data Science Online Communities Data scientists want to stay in touch with their peers in the field or in similar professions to learn new ideas and approaches, because Data Science methodologies and technologies are always changing. Only online resources can aid data scientists in keeping up with pace. The need for a community of Data Science experts to support this work has sparked a variety of online fora and groups. Data scientists can connect and efficiently evolve the field by participating in Data Science online communities. The most prominent communities are mentioned below but this is an area where new communities may emerge and become successful. Kaggle Kaggle, a Google subsidiary, is the largest data science community with millions of active members and a wide range of resources. Data scientists can find public datasets, educational resources and cloud-based workbenches to support their data analysis work. https://www.kaggle.com/ IBM Data Community IBM Data Community is an online forum with blogs dedicated to data science. It hosts research papers, webcasts and presentations that are updated as the field evolves. https://community.ibm.com/community/user/home There are more online communities, some of them supported by governments, some run by volunteers. Some are more focused on the community side with face-to-face meetings, while others are focused on the code required for data science projects. = Q Search + Code 오 < Explore and run machine learning code with Kaggle Notebooks. Find help in the Documentation. + New Notebook Sign In Register Search public notebooks Recently Viewed Random Forest GPU TPU Python R Beginner NLP Competition notebook Scheduled notebook All notebooks Trending Fork of Fork of lgbm classifier Updated 6 hours age my tatibinstal-1 Bellabeat Biz Insight with Updated 3 hours ago FitBit Fitness Tracker Dista Filters See all (244) TPS Mar 22: Neural! Network by Updated 6 hours ago Tabular Playground Series May 2022 Figure 1.12: Kaggle.com home page Table 1.11: Online communities Data Science Central Stack Exchange Data Science Society Driven Data Data Community DC Reddit https://www.datasciencecentral.com/ https://datascience.stackexchange.com/ https://dssberkeley.com/ https://www.drivendata.org/ https://www.datacommunitydc.org/ https://www.reddit.com/r/datascience/ Remember to always check the online reputation of the content contributor before using a dataset, code or tools. Check for the permissions of use for each dataset and try to download software tools directly from their developers' repositories. وزارة التعليم Ministry of Education 38 2024-1446
Data Science Online Communities
Kaggle
IBM Data Community
Online communities
Remember to always check the online reputation of the content contributor before using a dataset, code or tools.
Kaggle.com home page
Exercises 1 Read the sentences and tick ✓ True or False. 1. In machine learning models and algorithms, all the numbers and information are converted into matrices. 2. When the data from an analysis gets generated, a data scientist needs practical statistical and probabilistic knowledge to be able to understand and interpret that data. 3. Discrete mathematics specializes in logic and deduction methods which are paramount aspects of algorithm design, which is the basis for machine learning. 4. Some online communities are supported by governments and some run by volunteers. 5. An Enterprise Architect is the person who designs the information systems for organizations and companies. 6. A Data Scientist is a professional that takes the insights from the processed datasets and generates reports, visualizations and various other analytics that are aligned with the original objectives of the data science project. 7. A Data Analyst is a professional who is responsible for the storage and flow of information in a company or organization. He works with data scientists and engineers to build the appropriate data pipelines for dataset input, analysis and results output. وزارة التعليم Ministry of Education 2024-1446 True False 39
2 Explain how Python can help a Data Science professional. 3 Explain how Jupyter can help a Data Science professional. 4 Mention the most important tools for Data Science. How exactly do they contribute to each Data Science step? وزارة التعليم Ministry of Education 40 2024-1446
5 Why is understanding statistics a fundamental skill for a data scientist? Can you think of an example involving data analysis? 6 Python is a versatile programming language. Is it enough for data science projects? 7 On the internet, find three Python libraries that are very popular among data scientists. Briefly explain why they are popular. وزارة التعليم Ministry of Education 2024-1446 41
8 Compare and contrast an IDE and Jupyter Notebook. What are the main ways they differ? 9 You are learning to become a data scientist and have mastered Python coding. What other tools will you need for your data science toolkit? 10 In this lesson there is a list of professions related to data science. Which one would you prefer to follow and why? What challenges do you think you would face in this profession? 11 Visit an online data science community and search for a simple self-study training course to enhance your knowledge of data science. Evaluate how appropriate the course is to your level of knowledge. وزارة التعليم Ministry of Education 42 2U24 -1446