Entradas

What is Data Lake?

Imagen
What is Data Lake? A Data Lake is a storage repository that can store large amount of structured, semi-structured, and unstructured data. It is a place to store every type of data in its native format with no fixed limits on account size or file. It offers high data quantity to increase analytic performance and native integration. Data Lake is like a large container which is very similar to real lake and rivers. Just like in a lake you have multiple tributaries coming in, a data lake has structured data, unstructured data, machine to machine, logs flowing through in real-time. The Data Lake democratizes data and is a cost-effective way to store all data of an organization for later processing. Research Analyst can focus on finding meaning patterns in data and not data itself. Unlike a hierarchal Dataware house where data is stored in Files and Folder, Data lake has a flat architecture. Every data elements in a Data Lake is given a unique identifier and tagged with a set of metadata inf

Building a Data Lake: Step by Step

Imagen
Building a Data Lake: Step by Step Published on:  October 29, 2019 Linda Feng , Software Architect At Unicon, a common challenge that we hear about is this: “We have lots of data everywhere, but we want a way to bring the data into one place so that we can analyze it and use it to ‘see’ how we are doing, possibly to spot trends and ultimately to inform decision making.” Many universities and school districts today are in various stages of implementing systems to enable data collection for useful analytics. A common problem is that they want to collect different types of data and combine them in meaningful ways. And while there are many more factors that contribute to the success of transformative uses of data on campuses, one hurdle that IT administrators face is how to get their “data house” in order. During the last few years, I’ve spent most of my time helping customers assemble a variety of data sources into a data lake. What we have seen that works best is to first think through,

Building Serverless Data Lake Pipeline on AWS

Imagen
Building Serverless Data Lake Pipeline on AWS Today I’d like to talk about building serverless data lake on AWS. The reason of writing this post is to share my thinking with the world, to get feedback about my prototype, vision and, at the same time, to share experiences that may be of interest to data engineer practitioners, and other people. General Data Lake Pipeline What I’d like to do is to start with what a modern data lake pipeline looks like on AWS. Data lake pipeline Generate The first thing is generation, generating data sources. The typical ways to generate data sources in traditional application is done by transaction legacy system, ERP system, web logs, more and more like capturing information about consumers actually hitting the website, sensor networks feeding data into data pipeline. Collection The next part is collection side and you might see services like polling services running on EC2, going out to enterprise system to poll data from file systems or databases. Mode