2. Data Collection and Validation In this unit, students will obtain basic knowledge about data collection and validation. More specifically, students will learn what data collection is, as well as the different types of data and data sources. Special mention will be made of the topic of data coding, focusing on its advantages and disadvantages. Finally, students will learn about the data validation procedure, focusing on its respective types. 10 Learning Objectives In this unit you will learn to: > Define what data collection is. > Classify data sources. > Describe the attributes of information quality. > Understand the concept of open data platforms. > Recognize the importance of legal permissions for data collection. > Identify the different data types. > Define what data coding is. > Understand the process of data validation. > Categorize the data entry validation types. Ministry of Education 46 2024-1446

1: Data Collection

Data Collection and Validation

Learning Objectives

Lesson 1 Data Collection Link to digital lesson www.ien.edu.sa Data Collection The most important stage of research is the stage of data collection, which is the process of collecting facts, numbers and words relating to the target variables. Data collection can be carried out using various devices such as sensors and data recorders. It requires a deep understanding of the parameters under study, as well as planning, and diligent work in order to obtain good quality data. Good quality data enables the proper analysis process in order to perform the tasks effectively and further extract meaningful information about the phenomenon under study. Data collection methods vary depending on the type of data, but the process of verifying the stages of data collection in an accurate and truthful manner always remains important. Data Collection: The process of gathering and measuring data, including data acquisition, data labeling, and data improvement. Figure 2.1: Engineer collecting weather data Example Knowing the weather is one of the most important areas relating to travel. Several devices can be used to measure weather-related factors, including temperature sensors, anemometers and hygrometers. The data collected from these devices are temperature values, wind speed values and the concentration of water vapor in the air. وزاية التعليم Ministry of Education 2024-1446 47

1: Data Collection

Data Collection

Knowing the weather is one of the most important areas relating to travel.

Sources of Data There are two main classifications of information sources: primary data sources and secondary data sources. Primary Data Source A primary data source contains data that has only been collected. It can be collected from sensors, data recorders or even from questionnaires. For example, a temperature sensor that collects air temperature data is considered a primary data source. Another example is a wind speed sensor that measures wind speed. A questionnaire given to clients about the nature of the weather they prefer for foreign trips is also a primary data source. The anemometer is a wind speed sensor that measures wind speed. The wind generated by the airflow drives the top three wind cups to rotate, and the central axis drives an electric generator. The output of the generator operates an electric meter that is calibrated in wind speed. Figure 2.2: Anemometer Secondary Data Source This type of data is generated when we use a primary data. source in order to produce other data. For example, we can use air temperature and wind speed data from two different sensors in order to produce data for another parameter, called the wind-chill temperature. The wind-chill temperature can be found by multiplying the wind speed by 0.7 and then subtracting that value from the air temperature (wind-chill formula). In other words, first we can use the temperature and the wind sensors as primary data sources in order to collect temperature and wind speed data, and then a researcher can use the wind-chill temperature formula as a secondary data source in order to get wind-chill temperature data. 28°C "C Serpy 401 AP Figure 2.3: Weather forecast website Table 2.1: Differences between primary and secondary data sources Primary data sources Originality Collected directly from the original sources. Form Accuracy Source Cost Are in raw and unorganized form. More accurate as it is current data. Collected through sensors, questionnaires, interviews, experiments, etc. Expensive and more time consuming. Secondary data sources Not original data because someone has already collected them. Are in organized and processed form. Less accurate as it relates to the past. Collected from books, journals, documents, web pages, blogs, etc. Less expensive and less time consuming. effectiveness التعليم Ministry of Education 48 2024-1446

1: Data Collection

Primary Data Source

Secondary Data Source

Differences between primary and secondary data sources

Internal and External Data Sources Data sources can be categorized into internal and external sources. Internal data sources reflect those data that are under the control of the business while external data, on the other hand, are any data generated outside the walls of the business. For example, data collected from a sensor that belongs to a "university" or to a science institution is considered internal data, while data collected from other institutions, individuals or from sources outside the specific university is considered external data in respect to that university. Information Quality When data are processed, organized, structured or presented in a given context so as to make them useful, they are called Information. The value of information for a given use is characterized as "Information Quality" and it is an important factor that expresses the extent to which information can be used in making decisions. With the increase in data collection and preservation, the quality of information resulting from its processing has become of great and increasing importance. Ensuring the quality of information helps to accurately determine the requirements for implementing projects, as well as to direct services effectively, and increase efficiency in each working day. In comparison, inaccurate information can cause business disruptions, reduce efficiency, and lead to delays in completing projects. The quality of information can be checked by specific criteria which are called quality attributes and they are shown in the following figure: Accuracy Appropriateness Timeliness Completeness Level of Detail Figure 2.4: Attributes of Information Quality Here are some questions that can help you check if information is accurate: Can facts, statistics or other information be verified by other sources? Can the experiment be replicated and does it have the same results? Where is the information from? Why was the information generated? Based on your knowledge, does the information seem accurate? Does the information include misspelled words, misplaced characters and are the quotations cited correctly? قلالة التعليم Ministry of Education 2024-1446 49

1: Data Collection

Internal and External Data Sources

Information Quality

Here are some questions that can help you check if information is accurate:

Internal and External Data Sources Data sources can be categorized into internal and external sources. Internal data sources reflect those data that are under the control of the business while external data, on the other hand, are any data generated outside the walls of the business. For example, data collected from a sensor that belongs to a "university" or to a science institution is considered internal data, while data collected from other institutions, individuals or from sources outside the specific university is considered external data in respect to that university. Information Quality When data are processed, organized, structured or presented in a given context so as to make them useful, they are called Information. The value of information for a given use is characterized as "Information Quality" and it is an important factor that expresses the extent to which information can be used in making decisions. With the increase in data collection and preservation, the quality of information resulting from its processing has become of great and increasing importance. Ensuring the quality of information helps to accurately determine the requirements for implementing projects, as well as to direct services effectively, and increase efficiency in each working day. In comparison, inaccurate information can cause business disruptions, reduce efficiency, and lead to delays in completing projects. The quality of information can be checked by specific criteria which are called quality attributes and they are shown in the following figure: Accuracy Appropriateness Timeliness Completeness Level of Detail Figure 2.4: Attributes of Information Quality Here are some questions that can help you check if information is accurate: Can facts, statistics or other information be verified by other sources? Can the experiment be replicated and does it have the same results? Where is the information from? Why was the information generated? Based on your knowledge, does the information seem accurate? Does the information include misspelled words, misplaced characters and are the quotations cited correctly? قلالة التعليم Ministry of Education 2024-1446 49

1: Data Collection

Before the collection of any kind of information through

Important issues for the information timeline are the following: Check the dates of the sources used. Check the history of keywords for intellectual rights, such as registered trademarks, copyrights, patents, and trade secrets. Check the history of revisions or editing of the information. Check the date of publication. 28°C Expect says The high will be 227 1012b The MSN Weather website is a great example of finding information that meets the five attributes of information quality described above. ليم Figure 2.5: Example of an information source Open Data Platforms Open data platforms are platforms which support users in accessing collections of open data. Typical open data platforms present the data of the organization which hosts the platform. State governments or non-profit organizations host open data platforms which allow access by the general public to data. More specifically, they continuously collect and organize data from a variety of public sectors. These datasets can be utilized without any financial or technical constraints. Open data can be reused and redistributed, while taking into account the requirements posed by the data license. They can also be used by citizens of other countries as well. Enterprises may also provide open data through their corporate social responsibility programs. Some of the common uses of open data platforms include: > Transparency for government budgeting and spending on state services. > Performance statistics for government agencies. > Data from various public sectors e.g. education, healthcare or transportation which can be used for research that provides insights into the functioning of the country. > These datasets can be integrated into other applications. In Saudi Arabia, the Open Data Platform can be found at the address: https://open.data.gov.sa Ministry of Education 2024-1446 Cegder af Saudi Arabia Open Data Platform Thods of deets Rave been provided to the pikto as to formation, colaboration, and innovation 210 7,529 16,816 Figure 2.6: Open Data Platform 51

1: Data Collection

Important issues for the information timeline are the following:

Example of a source of information

Open Data Platforms

Data Privacy Any data that is related to a person and that can identify him or her is called personal data. For example, a name and a surname, a telephone number, an identity number, etc. are all personal data. Nowadays with so many people communicating online, there are many dangers, so it is important to protect ourselves. Data privacy ensures the ability of a person to determine for themselves when, how, and to what extent personal information about them is shared with or communicated to others. Legal Permissions to Collect and Use Data Collecting and using data for a research project requires legal permissions. Due to this fact, an Institutional Review Board (IRB) reviews proposals before a research project begins to determine if it follows ethical principles and legal regulations. The legal permissions can vary depending on numerous factors. The two main factors to take into consideration are the location in which the data is stored and the location of the end users that consume it. Companies and organizations need to ensure that the services that collect and consume data are legally aligned with the laws of their respective countries. Example Data which are hosted on the Open Data Platform of Saudi Arabia must be used by visitors in accordance with the terms of the Open Data License (https://open.data.gov.sa/en/pages/policies/license). Targeted Research and Data Comparison Targeted research is used when we want to focus on specific issues that have emerged from our primary research. For example, if we used temperature and wind values to predict weather in a city and then we observed that specific areas of this city have recorded extreme temperature values, this means that we have to conduct targeted research into these areas, in order to assess what other parameters apart from temperature are affecting the area. Data comparison is carried out when we have more than one dataset with registered data from the same area and from similar time periods. For example, we may have a dataset of temperature values recorded for the city of Jeddah in March 2021 and another set recorded in March 2022. Having these two data sets, we can easily perform data comparison in order to detect temperature variations or changes through the years. وزارة التعليم Ministry of Education 52 ZU24-1446 Open Data Policy Home / Explorer/Policies/Open Data Policy Menu Open Data Policy What is open data Open Cats Regations Government Open Data helps bridge the gap Open Data Platform - OPEN.DATA.GOV.SA The Open Ders Pattern of Saudi Arabia is an important initiative for implementing public and Figure 2.7: Saudi open data policies

1: Data Collection

Data Privacy

Legal Permissions to Collect and Use Data

Data which are hosted on Open Data Platform of Saudi Arabia must be used by the visitors under the terms of the Open Data License (https://data.gov.sa/ar/policies).

Targeted Research and Data Comparison

Exercises 1 Read the sentences and tick ✓ True or False. 1. Data collection is the process of gathering and measuring data. 2. There are two main classifications of data collection sources: primary and secondary. 3. The date that the information was published is an important parameter of information quality. 4. Appropriateness means that the more irrelevant the information is to what is being searched for, the worse its quality. 5. Levels of detail and accuracy are considered quality standards of information. 6. The five quality attributes help us check the reliability of information. 7. The Government has no authority on open data platforms. 8. The legal permissions to collect and use data can vary depending on numerous factors. 9. Targeted research is used when we want to focus on specific issues that have emerged from our primary research. 10. Data comparison can be done when we have more than one dataset with registered data from the same area and from similar time periods. وزارة التعليم Ministry of Education 2024-1446 True False 53

1: Data Collection

Read the sentences and tick True or False: Data collection is the process of gathering and measuring data.

حل Read the sentences and tick True or False: Data collection is the process of gathering and measuring data.

2 Briefly explain what primary and secondary data sources are. 3 Briefly describe each quality attribute which can be used to check the quality of information. 4 Give an example of targeted research and data comparison. وزارة التعليم Ministry of Education 54 2024-1446

1: Data Collection

Briefly explain what primary and secondary data sources are.

حل Briefly explain what primary and secondary data sources are.

Briefly describe each quality attribute which can be used to check the quality of information.

حل Briefly describe each quality attribute which can be used to check the quality of information.

Give an example of targeted research and data comparison.

حل Give an example of targeted research and data comparison.

5 Give examples and compare the primary and secondary weather data sources. 6 Visit the https://open.data.gov.sa open data platform and search for the information on the data usage permissions. Are there any exceptions? 7 Search on the internet for open data platforms in other countries. Can you find personal information on these platforms? وزارة التعليم Ministry of Education 2024-1446 55

1: Data Collection

Give examples and compare the primary and secondary weather data sources.

حل Give examples and compare the primary and secondary weather data sources.

Visit the data.gov.sa open data platform and search for the information on the data usage permissions. Are there any exceptions?

حل Visit the data.gov.sa open data platform and search for the information on the data usage permissions. Are there any exceptions?

Search on the internet for open data platforms in other countries. Can you find personal information on these platforms?

حل Search on the internet for open data platforms in other countries. Can you find personal information on these platforms?

8 Select two websites on the internet, one state and one private. Compare the quality of the information between them based on the five attributes. وزارة التعليم Ministry of Education 56 2024-1446

1: Data Collection

Select two websites on the internet, one state and one private. Compare the quality of the information between them based on the five attributes.

حل Select two websites on the internet, one state and one private. Compare the quality of the information between them based on the five attributes.