
My name is Alexey Chumagin, I am a tester at Provectus. In this article I will explain how data quality requirements are formed and what levels of data testing can be.
Upd:
The article deals with large (or not) data, on the basis of analysis and aggregation, of which different processes are built, patterns are derived for use in further analysis or for decision-making. The data can be collected for a specific project from scratch, or databases previously collected for other projects or for commercial purposes can be used. The sources of this data are diverse and include not only the input by operators, but also automated and / or automatic measurements stored in the database systemically or unsystematically (in a heap, “then we figure out what to do about it”).
end-of-upd.
Why data testing is important
Data plays an increasingly important role in decision making in everyday life and in business. Modern technologies and algorithms allow you to process and store huge amounts of data, transforming them into useful information.
What is this data? For example, the history of your browser, transactions on your card, the point of movement of a device. They are impersonal, but this data still belongs to a specific device. If you collect and process them, you can get quite interesting information about the owner of this device. For example, where he likes to go, what is his gender and age. So gradually, we “humanize” the device and give it some characteristics.
Then this information can be used for targeted advertising. If you are a woman, then with a high degree of probability we can say that you are not interested in advertising shavers for men. You need to show ads related to your interests. The quality of advertising targeting can be improved due to what is known about the devices on which it is shown. You are shown the ads you want to see. So you will click on it. People who show you this ad will receive money for it, and the ad customer will receive a profit from what you learn about his product.
All this is based on the data owned by different companies and people. To effectively use this data, it is necessary that they are reliable and we know that this account belongs to these transactions.
As data becomes very much, their storage demands considerable resources. Data cleansing is a separate task that needs to be addressed. We want to store only the data that we really need. And we don’t want to have in our database duplicates or records that do not meet our criteria. For example, entries with empty fields. Therefore, there are requirements for data quality and the question arises about their testing.
What is quality
I like this definition: product quality is a measure of user satisfaction. It is clear that everything depends on the context of use of the product. If you use any well-known product, for example, Facebook or Skype, then you have the same quality requirements. You will put up with some errors, but still continue to use this product. And if you are a customer of a program and paid money for it, the quality requirements will be higher. You will find fault, watch some trivia. Different people have different ideas about quality, and different programs also have their own quality requirements.
Therefore, before developing and testing, people usually determine what they will consider a quality product. All this can be described formally. For example, we will consider our product quality if it does not contain critical errors. Or if he works for two weeks without a glitch.
Determining these requirements is not an easy task. Typically, software requirements form the business, and if we ask the business what the data should be, we can get an answer that the data should be good and clean. The task of the tester is to find out or clarify what the data is and by what criteria we determine their quality and purity. These criteria need to be formalized and fixed, made measurable.
How data quality requirements are formed
The tester begins to find out what is incomprehensible to him and what he would like to know about the object of testing. The tester makes a list of questions and begins to take an "interview" with the customer. He, in theory, should know what the data should be. For example, I ask whether empty cells or duplicate rows are valid.
Example of requirements - if we have a list of people, then the first name, last name and middle name may be repeated. But the entire set of lines can not be repeated. Repetitions may be allowed for a single cell, but not for a whole row or for a collection of several cells. Full coincidence should not be.
Next we begin to ask about the format of the data in a particular cell. For example, there should be 12 digits in a telephone number, in a bank card number - 16. We may have a criterion that not every sequence of these signs is a bank card number. Or we understand that there can be only letters in a surname. We may have many questions about the data format. Thus, we find out everything that we need to know about the subject of testing.
What is quality data?
Qualitative data must have several characteristics.
- Completeness - there are no gaps in the records, all cells must be filled. Data should carry as much information as possible.
- Uniqueness - among the data should not be the same records.
- Reliability - for the sake of it all is being started. No one wants to work with data that cannot be trusted. The cells of the tables with qualitative data contain what they should contain: IP address, telephone number, etc.
- Accuracy. If we talk about digital data, then there must be an exact number of characters. For example, 12 decimal places. Data should be close to some kind of average value.
- Consistency - the data must maintain values, regardless of the way they are measured.
- Timeliness - data should be relevant, especially if they are periodically updated. For example, each month the amount of data should increase. Data should not be outdated. If we are talking about banking transactions, we are interested in them, for example, over the past six months.
Data test levels
We can group the data by the so-called layers - a good analogy with the
pyramid of testing works here. This distribution by the number of tests at different levels of the application.
- Unit-layer is when one module of the program is tested, most often it is one function or method. Such tests should be the most. A unit test for data is when we determine the requirements for each cell. It makes no sense to test further if we have errors at the cell level. If, for example, the surname contains numbers, then what's the point of checking something further? Perhaps there should be letters similar to these numbers. And then we need to fix everything and check the next level so that we have everything in the singular and not have duplicates, if that is what is said in the requirements.
- An integration layer is when several pieces of a program are tested together. The data API layer is when we talk about the entire table. Suppose we may have duplicates, but not more than a hundred pieces. If we have a million-plus city, then a million people cannot live on the same street. Therefore, if we make a sample on the street, then the number of addresses should be ten thousand or a thousand - this must be determined. And if we have a million, then something is wrong with the data.
- A system layer is when the entire program is fully tested. In the case of data, this layer means that the entire system is being tested. This includes statistics. For example, we say that we cannot have more than 30% of men born after 1985. Or we say that 80% of the data must be of the same type.
In conclusion, I will say that data testing is an area that provides many opportunities for creativity and development. There is no silver bullet here: different approaches can be used to test data. The truth, as always, is somewhere in the middle.