Chapter 6

Big Idea 3: Data and Information

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Summary:

This section is centered on how
computers can be used to help store, secure, and process large amounts of data
to make sense of it for either solving problems or for new findings.

 

Key
Ideas

·       Computers can
clean, process, and classify data much better and faster than people

·       Collaboration is
key to better results

·       Communicating
information visually helps get the message across

·       Scalability is
key to processing large data sets

·       Storage needs
led to the creation of compression techniques

 

So much raw data is being collected
constantly in all fields.  Every purchase
you make, those you return, plus web sites you visit are data that businesses
collect.  But what does it mean?  Computers enable us to process data to turn
it into information for decision making and research.  Computers can often identify patterns in data
that humans would not be able to detect. 
  As is often the case, there are
also trade-offs to consider with large amounts of data.

 

Data versus Information

Data collected from all types of events,
including visits, searches, inquiries, orders, returns, temperatures, scores,
attendees, number of items planted, lost, harvested, fish, birds, photos,
videos, and audio files are considered to be raw data.  These are all just values and descriptions until
we make sense of it.  While humans can
usually do an adequate job on small amounts of data, there is no way we could
process the vast amounts of data now collected in many raw data sets.  We get tired, distracted, and bored, and then
errors occur or opportunities are missed.

 

Cleaning

One area computers are very helpful is with
“cleaning” the data.  This includes
removing corrupt data, removing or repairing incomplete data, and verifying
ranges or dates among other steps.  Removing
or flagging invalid data is very useful. 
Again, individuals could easily miss errors in the data, which could
cause incorrect results in later processing.

 

Filtering

Computers are also able to easily “filter”
data.  This means different subsets can
be identified and extracted to help people make meaning of the data.  For example, all temperature values greater
than 98.6 could be meaningful and need further processing or perhaps just a
count of how many there are in the entire data set.

 

Classifying

Additionally, computers can help make
meaning of large data sets by grouping data with common features and
values.  These groupings or
classifications would be based on criteria provided by people who need to work
with the data.  There could be single or
multiple criteria used for these groupings. 
It would depend on the reason that the data was collected. 

 

Patterns

Computers are able to identify patterns
in data that people are either unable to recognize or cannot process enough
data to see the pattern.  New discoveries
and understandings are often made this way. 
When new or unexpected patterns emerge, the data has been transformed to
information for people to begin to interpret.  Computers make processing huge amounts of data
possible so people can make sense of it. 

 

Collaboration

Collaboration is a technique
especially useful in working to analyze data. 
Having a group with different backgrounds, specialties, cultures, and
perspectives can result in better analysis and use of the data.  Someone may ask or notice something that
others with similar backgrounds or someone working alone would not leading to a
new hypothesis or discovery about what the data represents. 

 

The use of technology now makes
collaboration much easier.  Remember that
the World Wide Web was created to allow scientists across the globe to share
documents for collaboration.  The tools
are much improved, allowing people in different time zones and locations to
easily work together.  Collaboration in a
face–to–face environment is always beneficial, but technology provides the
opportunity for many to collaborate via tools such as Skype and Google
Hangouts.  It also provides a way for
multiple people to work together on a document such as a Google document or a
Prezi presentation, but in their own time zone and on their own schedule. 

 

Sharing and Communicating Information

There are many tools available to
aid in communicating the insights identified from the data to others.  Graphics in the form of charts, tables, and
other designs are useful to present data in a visual format and in summary
format.  Remember the phrase, “A picture
paints a thousand words?”  Use it.  The human brain is wired to process
information visually, so the use of images and other visual tools are effective
ways to get a message across to help others understand it.  Providing ways for others to interact with
the data, such as providing sound files or videos when someone selects an
option is also useful. 

 

Large Data Sets (“Big Data”)

“Big Data” does not mean large
numbers, but instead means vast amounts of data.  These data sets have so many records, they
are too large to fit into the available memory of our computers or even servers
at our locations.  These files need multiple
servers to hold and process the data.  This
has led to the creation of “server farms” which are many large computers
located in one place for the purpose of processing data.  Businesses, universities, governments and
even individuals can contract for these server farms to run their data through
programs to do the cleaning and filtering listed above, and to then process the
data looking for the patterns, trends, and solutions.  Amazon has their Amazon Web Services (AWS)
division which provides a platform for data storage and processing.  Microsoft and Google and many other companies
also provide these services.

 

Software tools such as spreadsheets
and databases are used to filter, organize and search the data.  You are not required to know details about a
specific tool.  Search tools and
filtering systems are needed to help analyze the data and recognize patterns.

 

To process these extremely large data
files, new methods had to be created. 
One, called Map Reduce was created by Google.  Hadoop is an open source version of the same
processing model.  The datasets are
distributed over many servers.  Each
server processes the section of the dataset it has.  All servers are running the same program at
the same time.  These individual
solutions are then aggregated or combined for the final solution. 

 

Metadata

Metadata is data that describes
data and can help others find data and use it more effectively.  It is not the content of the data, but
includes information such as:

?     
Date

?     
Timestamp

?     
Author
/ owner

?     
File
size

?     
File
type

 

Metadata also includes “tags”
that are used to identify the content. 
These tags enable web searches to find the data more easily.  Multiple tags about a file are useful to help
people find it with their search criteria. 

 

Scalability

Scalability means the ability to
increase the capacity of a resource without having to go to a completely new
solution, and for that resource to continue to operate at acceptable levels.  The increase should be transparent to the
users of the resource.  For example,
processing should not slow down as the amount of data increases when solutions
are scalable.  In the case of data, the
resource would be additional servers to store and process the data.  Scalability is an important aspect to be able
to store and process large data sets. 
These files cannot fit on our computers or more organizations’
servers.  The tools we can use to process
them change as the file size grows. 

 

The “cloud” is considered a scalable
resource.  People connect, store, share,
and communicate across the Internet.  As
traffic or demand for resources increases, the cloud service manages the demand
by providing additional resources such as servers. 

 

Networks can also provide scalability.  As more devices are added to the network,
network managers increase access points and other devices to accommodate the
additional network devices and traffic.

 

Note that scalability also includes the
ability to downsize as needed, again without impacting the storage or
processing. 

 

Security

The security of our data deals
with the ability to prevent unauthorized individuals from gaining access to it and
preventing those who can view our data from changing it.  Strong passwords help block those trying to
gain unauthorized access.  That is one
reason many sites have increased their password requirements to contain
elements such as:

·       a capitol letter

·       a number

·       a special
character

·       be at least a
minimum length

·       cannot be the
same or almost the same as a previous password

·       cannot be the
same as the user id

 

Good security for those who need
access to certain features of a program include only providing “read” access to
those who should not change anything. 
Very few people would then have “update” and “delete” access limiting
accidental or deliberate changes.

 

We trust the companies that
maintain our personal information, including social security number and
financial information, like credit card numbers, to keep it secure.  As the news often reports, many companies
have had their security defenses breached and customer data stolen.  The data often is sold to those planning to
use unsuspecting users’ identities to open accounts and make purchases.  Always check your accounts often!

 

Security also relates to
encrypting data before it is transmitted to ensure it remains secure if it is
intercepted during transmission.  The
receiving location would decrypt the data for it to then be used as needed. 

 

Privacy

Digital footprints and fingerprints
are the little pieces are data we leave as a trail as we go through our daily
life.  Some of the ways our data is
collected occurs via:

·      
GPS coordinates embedded in photographs and apps
showing our locatio006E

·      
financial transactions such as viewing,
comparing, and making purchases

·      
web sites visited

·      
cell phones pinging off towers

·      
key card access to locations

 

Many people willingly provide
personal information to sites to gain access or privileges whether it’s through
sports teams, shopping, or restaurants. 
Their data is stored and may be sold with or without their
knowledge. 

 

Some sites claim to aggregate data
to protect individual privacy.  This
means summarizing the data findings at such a high level that no individual or
group should be identified.  However,
done incorrectly, individual information can be and has been identifiable.  It is surprising how one small identifiable
piece of data, such as a zip code, can then be used with other legally
available sources of data on the web, to identify a person.  All too often, this invasion of privacy is then
posted or shared is ways unknown to those impacted.

 

Representing Digital Data

New names for numbers have been
created to account for these large amounts of data.  These have been identified and agreed upon by
the members of the SI (International System of Units).  While the American and European number
systems are different, all countries agree with SI number identifiers.

·      
Remember
that a bit, “binary digit”, is the smallest unit for computers and is either 0
or 1.

·      
A
byte is made up of 8 bits.  It is the
basic unit used to describe memory.

·      
A
kilobyte is approximately 1,000 (“kilo”) bytes.

·      
A
megabyte is approximately 1,000,000 bytes (or a thousand kilobytes).

 

Note the powers of 10 and 2 in
the table below.

 

SI Naming Convention

Power in Binary

Power in Decimal

American Naming Convention

Kilo

210

103

Thousand

Mega

220

106

Million

Giga

230

109

Billion

Tera

240

1012

Trillion

Peta

250

1015

Quadrillion

Exa

260

1018

Quintillion

Zetta

270

1021

Sextillion

Yotta

280

1024

Septillion

 

There are considerations that
have to be constantly evaluated when it comes to storing data.

Data compression: 
Lossless and Lossy

There are also trade-offs for
storing data.  Image files become large
very quickly.  Therefore, compression
techniques were developed to decrease their size.  Raster or Bitmap images store data by pixel
(picture element).  If you had an image
that was 300 x 400 pixels = 120,000 pixels, and with 3 bytes per pixel, it
would take approximately 120000 * 3 = 360,000 bytes for 1 picture.  A 10 MB pixel image would take 30,000,000
bytes or 240,000,000 bits.  Depending on
the bandwidth, this image could take 30-60 seconds to download, which is quite
slow for one image.  Vector graphics have
smaller file sizes but images are not as lifelike as rastor images

 

We can reduce the amount of space
needed through data compression. 

 

·      
Lossless techniques
allow the original image to be retrieved. 
No data is lost, but the file size cannot be as compressed as with lossy
techniques.

 

·      
Lossy compression techniques
lose some data in the compression process. 
The original can never be restored, but the compression is greater than
with lossless techniques. 

 

JPEG (Joint Photographic Experts Group)
images reduce file sizes up to 90% by replacing similar colors with the same
color as large parts of the image.  The
replacements generally cannot be detected by the human eye. 

 

Note that the same concepts apply
for music and video files.  There are
lossy and lossless compression programs that can be used with these files.  As with image files, the original file can be
restored with the lossless technique, but not with a lossy one.

 

One compression lossless technique
that is commonly used is the Huffman encoding technique.  It assigns a code to each character, (numbers
and special characters included.) 
Characters that appear more frequently get a shorter code, and those
that are less frequent have a longer code. 
The larger the file, the greater the opportunity for a larger savings on
file size. 

 

The format the data is stored in
plays a role.  The three character file
extension is used to determine what software tools can open and process the
data.  For example, bitmap images are
files that are usually quite large.

 

Note, too that more storage space
will be needed to process files that need to be updated or “written to” versus
simply read.  This is because a copy of
the file will be made for writing to it, in part to keep a backup copy in case
it needs to be restored, as well as to handle the space needs of modifying the
file.  Reading files will not require
this extra space.

 

Vocabulary

 

Big Data

 

Classifying data

 

Cleaning data

 

Collaboration

 

Filtering data

 

Lossless data compression

 

Lossy data compression

 

Metadata

 

Patterns in data

 

Pixel

 

Privacy

 

Scalability

 

Security

 

Server farm

 

 

Review Questions

 

1.    
 Why is cleaning data important?

a.     
To
ensure bad data does not hide or skew results

b.    
Removes
bad or incomplete data

c.     
Repairs
bad or incomplete data

d.    
All
the above

 

2.    
 Why is analyzing big data important?

a.     
To
identify patterns that humans cannot see

b.    
To
increase the viability of server farms

c.     
To
verify existing solutions to problems

d.    
To
test due diligence

 

3.    
 Collaboration can provide:

a.     
Several
points of failure

b.    
Clean
data

c.     
Duplication
of effort

d.    
Insights
we may never get otherwise

 

4.    
Information
about the author of a document is:

a.     
Metadata

b.    
Content

c.     
Context

d.    
Cleaning

 

5.    
Being
able to add or remove resources to store large datasets is called:

a.     
Scalability

b.    
Filtering

c.     
Efficiency

d.    
Routing

 

6.    
 Providing someone read only access to data is
an example of:

a.     
Security

b.    
Privacy

c.     
Encryption

d.    
Ciphering

 

7.    
 This data compression technique provides the
most compression?

a.     
Lossy

b.    
Lossless

c.     
Filtering

d.    
Classification

 

 

Answers to
Review Questions

 

1.    
d

Data
needs to be cleaned to remove or repair corrupt or incomplete data to ensure
valid data is used for research and analysis.

 

2.    
 a

Analyzing
big data allows us to identify patterns that could help solve problems or
identify new possibilities that people likely could not process.

 

3.    
 d

Collaboration
can provide the possibility of insights we may never get otherwise by having
more than one person with different backgrounds and perspectives create, design,
and evaluate data, documents, products, etc.

 

4.    
a

Information
about the author of a document is data about the data which is metadata.

 

5.    
a

Scalability
is adding or removing resources to store and process large datasets. 

 

6.    
a

Security
is providing the appropriate level of access to data or software
functionality. 

 

7.    
a

Lossy
data compression provides the most compression.