Marvel: Infinity War or How to collect data for your project in a couple of minutes

I have two areas of interest. The first: a society of anonymous ~~lazy~~ data analysts, the second: a society of anonymous geeks. And if from the second I have everything ok, then from the first everything is more complicated. When you tell people, what tasks do data analysts solve, what do they represent? For the sake of experiment, she introduced the definition in Google and the first:

A data analyst is a universal specialist who has knowledge of mathematics, statistics, computer science, computer science, business, and economics. A Big Data analyst examines large data sets containing fragmented information, for example: research results, market trends, customer preferences, etc. Research and analysis of such information can lead to new scientific discoveries, improved company performance, new income generation opportunities, and improved customer service. etc. The main skill of data experts is to see the logical connections in the system of collected information and on the basis of this to develop certain business solutions, models.

Definition from buduguru.org/profession/39 .

Universal specialist, okay. Judging by the description of something between Dr. Manhattan and Stephen Hawking.

However, I will not go into the semantics of this definition, I want to talk about the sensitive topic of data analysts (no, not the one where the ~~whining~~ talks about the lack of data). And what if the data is?

And here we go to the resulting problems:

What tools should be used to study this data?
How to convert these data arrays?
How to store them? Do I need to store them?
What if there are a lot of sources, and they are all heterogeneous?

Okay. We have formed a pool of problems, but what to do next? In this article I will talk about the tool that our development team has implemented, namely, the iDVP.Data SaaS cloud system.

What it is?

iDVP.Data SaaS is a multifunctional tool for working with data in the cloud, which allows you to connect various data, convert it and give it to external systems, like web services.

War of infinity

Here we involuntarily intersect with the second sphere of interest: as an example, I decided to connect the open data of Marvel to iDVP.Data SaaS . Have everyone watched a new movie about the War of Infinity? After watching the film, I could not help but remember other large-scale conflicts in the Marvel universe, which suffered global changes in the franchise. I was curious to remember how many characters in the comic line participated in Infinity War, and how many died in it? To answer these questions, I turned to the most reliable source - the official site of Marvel .

First of all, go to the iDVP.Data SaaS website and register.

After that we get to the user's work page, where there are workspaces with test demo cases. They present data streams from connecting a data source to a data mart.

Having studied the test examples and adding a new workspace, we proceed to create your own data flow. As sources, I chose the following data:

REST service that returns information about all the characters in the Marvel universe;
REST service that returns information about all the events of the Marvel universe;
a CSV file containing the main participants in the civil war.

Step 1. Connect

Alternately connect the data:

As a result, we get three connected data sources:

participants_marvel_raw - service;
characters_marvel_raw - service;
events_marvel_raw is a CSV file.

Step 2. Convert

After connecting the data, we create data sets (Datasets), where we perform the necessary transformations (data cleansing, calculations or, for example, data parsing from JSON) using SQL scripts.

select k.id, k.name, k.com.name as comics_name, k.ser.name as series_name, k.stor.name as stories_name, k.event.name as events_name from ( select a.id, a.name, flatten(a.comics) as com, flatten(a.series) as ser, flatten(a.stories) as stor, flatten(a.events) as event from ( select c.`data`.id as id, c.`data`.name as name, c.`data`.comics.`items` as comics, c.`data`.series.`items` as series, c.`data`.stories.`items` as stories, c.`data`.events.`items` as events from ( select t.res.`data`.`results` as `data` from ( select convert_from(a.content, 'JSON') res from `characters_marvel_raw` a ) t ) c ) a ) k

As a result, we get parsed data:

And such a chain of data flows:

After connecting the data and converting it, the access speed to the received information may still be low (due to the long response of the source or due to the large amount of data). This is where the mechanism of “materializing” (saving) the data in iDVP.Data SaaS itself is triggered . Accessing stored data is extremely fast, even when working with large amounts of information through the use of BigData technology. Saved data can be updated at any time (in whole or in part), and also set up a schedule according to which the system will update them automatically.

Thus, it is possible to accumulate historical data, even if the source itself does not support this. Materialization also helps to continue working with data, in case the source becomes unavailable, by storing it in the iDVP.Data SaaS file system.

Step 3. Publish

We create a data mart (web service), which is also a SQL query. In the data mart you can define input and output parameters.

Once the data marts are created, they can be publicly shared and used in their external systems.

The resulting service can be used to build reports and 3D applications, as, for example, we did to visualize the data of Election 2018 .

PS Conclusions

First conclusion

We remembered the comic strip, which mentions the War of Infinity, and this is what we got:

57 characters participated;
5 not confirmed;
15 died.

Second conclusion

If you need to quickly and easily deal with the data, you can use the iDVP.Data SaaS system, which is currently in beta testing. Our team hopes that among you who have read this story to the end, there are those who will be the first testers of our new tool.

With it, you can do it yourself:

connect to various sources;
consistently receive data from any sources;
perform ETL data transformations using SQL;
increase the speed of working with data using BigData technologies;
analyze data;
provide data to external systems;
perform these operations in a convenient and simple interface.

Thanks in advance for your feedback!

Example of use on post comments:

Statistics by comments .

Source: https://habr.com/ru/post/412579/

All Articles

Marvel: Infinity War or How to collect data for your project in a couple of minutes

War of infinity

PS Conclusions

More articles: