Wednesday, February 09, 2022

Data Warehouse and Data Lake

 Data Warehouse and Data Lake in a nutshell

A Data Warehouse is used as a central storage for large amounts of structured data that might be coming from various sources. 


Such stores are very important to companies as they can be used to deliver insights from across the organisation to support decision making.

On the other hand, a Data Lake is a flexible storage that is used to store unstructured, semi-structured or structured raw data. The stored data is unprocessed and the structure is usually applied when it is retrieved. Note however that a Data Lake is not a replacement for a Data Warehouse.

Which to choose

Data Warehouses and Lakes are both used by organisations as centralised data stores that enable different users and organisation units to access and use data to extract insights and perform any sort of analysis. Usually an organisation will need both a Data Lake and a Warehouse to support all the required use-cases and end users.

A data lake is capable of housing all data of any form; from structured to unstructured. Additionally, it does not require any sort of pre-processing before storing the data as this can happen once it is stored in the data lake. Data Lakes are mostly useful to Data Scientists and Engineers that require access to even unstructured data that will help them build Artificial Intelligence or Machine Learning models. Data Lakes are also more cost efficient compared to Data Warehouses as they don’t require data to have any particular format such as a schema.

Now a data warehouse is only capable of storing structured data which are ready to be analysed by specific organisation units in order to unveil business insights. Therefore, ETL processes are usually required to be built around the Data Warehouse. ETL functionality enables data to be stored in the expected format and extracted or transformed so that users can perform particular tasks over them. For that reason, Data Warehouses are very powerful for business or operations analysts that require to have access to relational data with schema that will enable them to create reports and support decision making by discovering insights.

A Final Word

In this article, we discussed the key differences between Data Lakes and Warehouses. Note though that this is not an apple-to-apple comparison.Both support different use-cases and serve different users and usually organisations require both to operate efficiently.

Data Lakes are more flexible and schema-less stores that are capable of storing unstructured, semi-structured or structured data. They are usually useful to more technical users such as Data Scientists or Engineers. On the other hand, Data Warehouses can only accept relation data which in turn is more useful to less technical people who need access to ready-for-analysis data.


Hard Disk Drive

 Hard drive is a non-volatile memory hardware which store all your computer’s information and data physically in it and houses the hard disk. Sometimes hard disk is also known by the terms Hard Disk Drive.

A Hard Disk Drive has a magnetic material coated plates where the information to be recorded. In contrast to sequentially addressable storage media such as magnetic tape or punched tape drives, Hard Disk Drive is a direct access storage device or DASD, as data can be accessed directly from a Hard Disk Drive.

Structure  of Hard Disk Drive

Common form factors for Hard Disk Drive (the width) is 5.25 “, 3.5”, 2.5 “and 1.8” . Upto 0.85 inch Hard Disk Drive is available but usually 3.5 inch hard drive is the standard in desktop computers. A hard disk consists of one or more rotating mounted disks, one axis, also called spindle, an electric motor to rotate the disks, moving heads which read and writes the data, an IC for for controlling the motor and head, a dedicated RAM, outer very strong casing and the interfaces to attach for power supply and data exchange.


Function Of Hard Drive

The main function of hard disk is to store data for long term and data can be computer’s operating systems, applications, documents, personal files and so on. Now the main thing that need to notice that how much amount of data storage capacity hard drive have and that measured in gigabytes or terabytes. Performance refers to read and write speeds, latency, seek time. Seek and latency time measured in the time that are going to take magnetic heads in hard disk to access the data desired and the read write speed and data transfer rate are measured that how much capacity data can be written from the drive and read from a desired location in specific amount of time.