Which Cloud Storage type is best for data accessed on average once or less during a 30 to 90 day period?

The two key components of any data pipeline are data lakes and warehouses. This course highlights use-cases for each type of storage and dives into the available data lake and warehouse solutions on Google Cloud in technical detail. Also, this course describes the role of a data engineer, the benefits of a successful data pipeline to business operations, and examines why data engineering should be done in a cloud environment. This is the first course of the Data Engineering on Google Cloud series. After completing this course, enroll in the Building Batch Data Pipelines on Google Cloud course.

View Syllabus

Cloud storage is the essential storage service for working with data, especially on structured data. In the cloud. Let's do a deep dive into why Google cloud storage is a popular choice to serve as a data lake. Data and cloud storage persists beyond the lifetime of VMS or clusters ie It is persistent. It is also relatively inexpensive compared to the cost of compute. So, for example, you might find it more advantageous to cash the results of previous computations in cloud storage. Or, if you don't need an application running all the time, you might find it helpful to save the state of your application in the cloud storage and shut down the machine that is running on when you don't need it. Cloud storage is an object store, so it stores and retrieves binary objects without regard to what data is contained in the objects. However, to some extent it also provides file system compatibility and can make objects look like and work as if they were files, so you can copy files in and out of it. Data stored in cloud storage will basically stay there forever. In other words, it is durable, but it is available instantly. It is strongly consistent. You can share data globally, but it is encrypted and completely controlled and private. If you want it to be. It is a global service, and you can reach the data from anywhere. In other words, it offers global availability. But the data can also be kept in a single geographic location if you need that. Data is served up with moderate latency and high throughput. As a data engineer, you need to understand how cloud storage accomplishes these apparently contradictory qualities and when and how to employ them. In your solutions. A lot of cloud storage is amazing properties have to do with the fact that it is an object store and other features are built on top of that base. The two main entities in cloud storage are buckets and objects. Buckets are containers for objects, and objects exist inside of buckets are not apart from them. So buckets are containers for data. Buckets are identified in a single global unique name, space, so that means once the name is given to a bucket, it cannot be used by anyone else unless and until that book, it is deleted and the name is released. Having a global name space for bucket, it simplifies locating any particular bucket. When a bucket is created, it is associated with a particular region or with multiple regions. Choosing a region close to where the data will be processed will reduce latency, and if you are processing the data using cloud services within the region, it will save you on network egress charges. When an object is stored, cloud storage replicates the object. It monitors the replicas, and if one of them is lost or corrupted, it replaces it with a fresh copy. This is how cloud storage gets many nines of durability for a multi region bucket. The objects are replicated across regions, and for a single region bucket, the objects are replicated across zones. In any case, when the object is retrieved, it is served from the closest replica to the requester, and that is how low latency occurs. Multiple requesters could be retrieving the objects at the same time from different replicas, and that is how high throughput is achieved. Finally, the objects are stored with metadata. Metadata is information about the object. Additional cloud storage features use the metadata for purposes such as access control, compression, encryption and lifecycle management. For example, cloud storage knows when an object was stored, and it can be set to automatically delete after a period of time. This feature uses the object metadata to determine when to delete the object. You may have a variety of storage requirements for a multitude of use cases. Cloud storage offers different classes to cater for these requirements, and these are based on how often data is accessed. Standard storage is best for data that is frequently accessed. Also referred to as hot data and or stored for only brief periods of time when used in a region. Co locating your resources maximizes the performance for data intensive computations and can reduce network charges when used in a jewel region. You still get optimized performance when accessing Google Cloud products that are located in one of the associated regions. But you also get the improved availability that comes from storing data in geographically separate locations. When used in a multi region, standard storage is appropriate for storing data that is accessed around the world, such as serving website content, streaming videos, executing interactive workloads or serving data supporting mobile and gaming applications. Near Line storage is a low cost, highly durable storage service for storing infrequently accessed data. Near line storage is a better choice than standard storage in scenarios where slightly lower availability, a 30 day minimum, storage duration and cost for data access are acceptable trade offs for lowered at rest storage costs. Near line storage is ideal for data you plan to read or modify on average once per month or less. Near line storage is appropriate for data backup, long tail multimedia content and data archiving. Cold line storage is a very low cost, highly durable storage service for storing infrequently accessed data. Cold line storage is a better choice than standard storage or near line storage in scenarios where slightly Lower availability, a 90 day minimum storage duration and higher costs for data access are acceptable trade offs for lowered at rest storage costs. Cold line storage is ideal for data you plan to read or modify at most once a quarter. Archive storage is the lowest cost highly durable storage service for data archiving, online backup and disaster recovery. Archive storage has higher costs for data access and operations, as well as a 365 day minimum storage duration. Archive storage is the best choice for data that you plan to access less than once a year. For example, cold data storage, such as data stored for legal or regulatory reasons and disaster recovery. Cloud storage is unique in a number of ways. It has a single api millisecond data access latency and 11 9's durability across all storage classes. Cloud Storage also offers object Lifecycle Management, which uses policies to automatically move data to lower cost storage classes as it has access less frequently throughout its life. Cloud Storage uses the bucket and name an object name to simulate a file system. This is how it works. The bucket at name is the first term in the U R I. A forward slash is appended to it, and then it is contaminated with the object name. The object name allows the forward slash character as a valid character in the name. The very long object name with forward slash characters in it looks like a file system path, even though it is a single name. In the example shown, the bucket name is D E class, the object name is D E forward slash modules forward slash 02 forward slash script dot sh. The forward slashes are just characters in the name. If this path we're in a file system, it would appear as a set of nested directories, beginning with D E Class. Now, for all practical purposes, it works like a file system. But there are some differences. For example, imagine that you wanted to move all the files in the 0 2 directory to the 03 directory inside the modules directory. In a file system, you would have actual directory structures, and you would simply modify the file system metadata so that the entire move is atomic. But in an object store simulating a file system, you would have to search through all the objects contained in the bucket for names that had 02 in the right position in the name. Then you would have to edit each object name and rename them using 03. This would produce apparently the same result, moving the files between directories. However, instead of working with a dozen files in a directory, the system had to search over possibly thousands of objects in the bucket to locate the ones with the right names and change each of them. So the performance characteristics are different. It might take longer to move a dozen objects from directory 02, directory 03, depending on how many other objects are stored in the bucket. During the move, there will be list inconsistency with some files in the old directory and some in the new directory. A best practice is to avoid the use of sensitive information as part of bucket names. Because bucket names are in a global name space, the data in the bucket is can be kept private if you need it to be. Cloud storage can be accessed using a file access method that allows you, for example, to use a copy command from a local file directly to cloud storage. Use the tool G Sutil to do this. Cloud storage can also be accessed over the Web. The site, storage dot cloud dot google dot com, uses TLS https to transport your data, which protects credentials as well as data in transit. Cloud storage has many object management features. For example, you can set a retention policy on all objects in the bucket, for example, the object should expire after 30 days. You can also use versioning so that multiple versions of an object are tracked and available if necessary. You might even set up lifecycle management to automatically move objects that haven't been accessed in 30 days to near line and after 90 days to cold line.