May 28, 2014

The Efficient Analyst (part one): Stop moving your data around!

Many organizations collaborate by moving data and information between people, files or applications.  While movement may provide analysts with the illusion of adding value, data movement is not the same thing as data sharing. While the latter is a good thing to be nurtured, the former causes version control problems, wasted time, and analyst frustration.

In our original post on why data analysts are so inefficient, we covered the six core causes of data waste. In today's post we'll tackle the first cause: conveyance.  Below are five ways to reduce data movement and put the focus back on sharing.

1. Stop emailing (IMing, Skyping) files back and forth. Email and chat programs are some of the most popular ways to move data around. While we are all guilty of doing this every now and then, it creates huge access and version control problems for your team. Stop doing it. The most effective alternative is to utilize your company's sharing environment to store and update files. Don't have a sharing environment? Go to #4.

2. Eliminate "staging" files or sheets. A good general rule is: do not put data into a spreadsheet, database or file that does not add specific value. Many use staging files when restructuring or reformatting data before being used by someone or something else. Eliminating these types of files not only eliminates waste and confusion, but also ensures faster assembly of process documentation.

3. Be careful with "movement automation". This category can include FTP syncs, VBA macros or any other kind of automated process that pulls data from one location into another. FTP syncs often create timing bottlenecks (to be discussed in a future post) or create confusion among users. VBA macro code in Excel can cause chaos when there are structural changes in either the source or destination spreadsheets. In general, any kind of data automation must be handled with a high level of care and used only when the risk outweighs the amount of effort it takes to manually execute the task.

4. Create a sharing environment. There are many products for sharing data across organizations. Some upload files to share drives or use Sharepoint or Dropbox, while others will build custom SQL databases to store raw data. Some use Shooju. Whatever option you choose, ensuring that it is a complete solution is critical to creating any form of internal data harmony. Because of the confusion they can cause without full adoption, partial solutions can be worse than no solution at all.

5. Google it. Some organizations have no love lost for the cloud, while others are rapidly making the transition. Wherever your company is positioned, Google Drive can be an excellent solution to specific sharing problems. While seasoned data analysts may scoff at this idea, there are great ways to track changes and store raw data sets that streamline shared data flow.

A word about sharing across applications

The suggestions in this post are most effective when moving data between individuals using the same application - i.e. moving an Excel file to another Excel user. Sharing data across applications is much more difficult. In many organizations, analysts that use more advanced analytical tools like SAS often have to export results into Excel spreadsheets (creating a "staging problem") and email them (gasp!) to colleagues who are non-SAS users. Additionally, keeping this staging files updated can be a big pain.

Readers: what tips do you have to solve data sharing across applications? Leave them in the comments below.

May 27, 2014

Open government and Shooju

This Friday and Saturday, Shooju will be attending Transparency Camp at George Mason University.  For this event we have been working with a team looking shed on light government spending and procurement practices through publicly available data. Check out a visualization tool that shows government spending by congressional district.

May 23, 2014

Microsoft Excel tip #1: reduce errors with conditional formatting

Here at Shooju we know first hand how stressful it is to find spreadsheet problems the night before a big project deliverable. Or worse: not finding those mistakes until they're too late. In our inaugural Excel tip, we'll provide examples of how to use conditional formatting to find and eliminate pesky Excel errors - saving time, money, embarrassment and most importantly, your sanity.

What is conditional formatting?

Conditional formatting allows you to format cells based on a set of specific parameters or conditions. For example: if the value in this cell is greater than 5, make it blue.

You can access conditional formatting on the "Home" tab in Excel. Once opened, selecting "new rule" will reveal a list of options that include "Format all cells based on their values" or "Format only cells that contain". Try them out and see how they work. Some examples are provided in this video.

Error reduction techniques

There are three ways to use conditional formatting to find errors: find binary (yes or no) errors, find duplicate values, or find outliers.

Find binary errors

Here, a rule was created to highlight any cells containing calculation errors bright red:

Examples of other binary errors include:
  • letters (strings) where there should numbers (integers)
  • values that should be true but are false (or vice versa)
  • cells over a max value (or under a min value)

Find duplicate values

Below a rule was created to any value that appears more than once - also in red:

Find outliers

Establishing a color range allows you to detect values that are well outside of the parameters of the others. In this example below, the value highlighted in red is two orders of magnitude higher than the others in the sheet.

Get more on conditional formatting from Microsoft, Chandoo and GCFLearn.

May 22, 2014

Three ways to visualize your data
Everyone loves data visualization. By converting data into art, analysts and researchers are able to translate complexity into storytelling. There are countless ways to visualize data, from making simple charts in MS Excel to custom designed javascript applications. In this post we'll talk about the three core ways to visualize data, when to use each method, and some of the associated challenges with each.

In application

The most straight forward way to visualize data is in your analysis application. Many analytical tools like MS Excel, Python, R, STATA, etc. all provide access to visualization elements that range in power, flexibility and ease of use.  In general, using these types of applications are the fastest and easiest way to create visualizations, so if you are on a deadline or producing a client deliverable, this is usually the best way to go.

MS Excel

Of course, there are many limitations to keeping your visualizations inside of your analytical tools. While "power users" can do some impressive stuff with analytical tools, you still must work within the framework provided by the application. Many tools are getting better at data visualization with greater options and increased flexibility, but it's still difficult (though not impossible) to provide a visualization that "wows" your customers or clients. If you are using "big data", some of the visualization tools inside of different tools (like Python or STATA) can help you, but they require additional libraries, a strong background in the application and/or some programming skills to get great visual results. Finally, if you are looking to make your chart or visualization interactive or web-based then you are generally out of luck.

When to develop visualizations "in application"
  • you are on a tight deadline
  • your deliverables are static and require no interactivity (web)
  • you are operating with "small data"
  • you have experience in your chosen tool

Business intelligence tools

There are currently dozens of business intelligence (BI) that can be used to create sophisticated visualizations. Software applications like Tableau, Spotfire and Qlikview offer a robust set of visualization templates including mapping (GIS) and charts/graphs not available natively in standard analytical applications. Most BI tools offer free trial periods and robust tutorials, so depending on your timetable, the most effective approach is typically to select a couple and kick the tires. We also recommend checking out the Gartner Magic Quadrant as one of your first steps:

A few warnings when using Business Intelligence tools and platforms. Increased user-friendliness of BI tools comes at a sacrifice to flexibility on both the front and back ends. Most BI tools require your data to be structured in a very specific way either in your spreadsheets or database files, which can take up the majority of your project time to set up. On the front end, prefabricated templates and structures can limit creativity with your visuals. While these templates cover the needs of many users, those looking for more cutting edge or original designs may be disappointed. Finally, while many BI tools advertise themselves as fit to be used in web applications or as part of customer-facing products, their lack of flexibility and limits on user interactivity make them more ideal for internal decision making.

When to use BI for visualization
  • you are using "big data" sets
  • you require mapping and more sophisticated visualizations
  • you have budget for third party application
  • you don't have in house front-end designer(s) or developer(s) at your disposal
  • you want your visualizations to be dynamic
  • your focus is internal rather than external

Open source libraries

Open source charting libraries have become some of the most popular options for visualizing data on the web. While they are not as flexible as creating visuals from scratch, open source libraries provide developers the backbone to create fun, creative and interactive visualizations that can impress any beholder. These types of libraries are the optimal choice for client-facing products and anything that is web-driven. Highcharts is great for basic visuals, and can provide solutions to charting problems ranging from the simple to the complex. D3 on the other hand is more avant garde and community-driven, so it is often a breeding ground for visual creativity. At minimum it's a great place to go for inspiration (like the one below from The Guardian or this terrific one from The New York Times).
Courtesy of The Guardian,

The primary challenge with using these types of tools is that you need some knowledge of javascript or a professional developer who does. Additionally, you need to have your data hosted someplace (Shooju works great!) to pull it into your visualizations. It's also not optimal for short timelines, as development time can range depending on the complexity of the visualization and the data behind it.

When to use open source libraries for visualization

  • your data visualization is going on the web
  • you have javascript knowledge or an in-house/contract developer
  • you have a longer timeline
  • you have a data platform or data hosting environment
  • you are looking to be creative or "wow" your customers
Conclusion: do your homework!

This post is meant simply as a starting point on different ways to visualize your data. Before you begin your visualization project, we suggest you do your homework to get a full understanding of what products and services are out there to maximize the value of the information you are trying to visualize. Best of luck!

May 16, 2014

How to compete using small data

Source: The New Yorker

The business world continues use data to gain a competitive advantage. While "big data" has reached a level of media prominence, small data analysis continues to comprise the silent majority of business intelligence. Organizations of all shapes and sizes use small data to develop and maintain competitive advantage in a wide variety of industries. In this post we'll define small data, discuss its associated challenges, and map out its place in the data universe.

What is small data?

Small data is, in many ways, an inverted approach to big data. Rather than having enormous amounts of data from a few key sources, small data means using a lesser amount of data from a more diverse set of sources.

Despite its low profile, the vast majority organizations that analyze data use this type of approach. And many who are the most successful at small data pull their information from publicly available or non-proprietary sources.  Some examples of small data that are commonly used include:

  • Historical pricing for goods and services
  • Polling data (raw or cross-tabs)
  • Econometric data
  • Historical weather patterns
  • Financial reporting information

While none of these types of data are terribly exciting, they are the backbone to the analytical frameworks of most groups and organizations looking to synthesize information into actionable decision-making.

Big or small, velocity is always important

Much has been made of the Three V's of Big Data: Volume, Variety and Velocity. At the risk of stating the obvious, the timeliness of data is critical to achieving competitive advantage whether you are using one data point or a billion. Often, small data is pre-processed before you receive it, and as such it can never compete with the currency of data that comes directly from the source. The relative size of small data, however, can reduce the amount of time it takes to analyze and extract actionable insight from the compiled information.

Channel of success

The proverbial "sweet spot" of competitive advantage in data comprises a very tight corridor between creating original insight and avoiding analysis paralysis. As shown on our original 2x2 matrix, not having enough data points or data sources will surely leave you in the dust, while, having too many of both can make it virtually impossible to achieve any kind of insight in a timely fashion. Every industry is different and there are exceptions to every rule (Google), but a general guideline looks something like this:

Small data in action: Nate Silver in 2012

A very public example of small data success is Nate Silver's accurate prediction of the 2012 presidential election. Starting before the 2008 election, he realized was that polls are inherently inaccurate because of their built-in biases and statistical margins of error. So, rather than creating his own poll, he simply aggregated the results of many different polls and adjusted for historical accuracy. While this was methodologically sophisticated, it did not require enormous computing power, proprietary software or a room full of servers; it was done using a series of spreadsheets on a laptop computer.

For Nate, the insight he achieved was not from the volume of data, but in its diversity. By being able to see that no data source is perfect and realizing that even publicly available data can be melded and rearranged to offer competitive insight, he was able to raise his profile to one of the foremost authorities on data and analysis.

The hardest part of small data

Despite its value, the hardest part of using a lot of different data sources is getting them in one place where they can be used. With data diversity comes a wide variety of formats, structures, interfaces, storage mechanisms and errors. And even once these sources have been pulled together for the first time, the critical task of keeping the data constantly updated is often enormously time consuming. Even if data consolidation is streamlined (perhaps using an ETL process) building the infrastructure for storing and accessing this data across organizations can take years at huge cost.

We at Shooju see this as the number one challenge for organizations looking to gain actionable insight from small data. To learn more about how we solve this problem, and some insight into the best technologies for the job, check out our architecture.

It's how you use it

As the saying goes, it's not the size that counts, it's how you use it. This framework lays out an ingredient to analytical success with small data, and doesn't replace the value of true insight. Our belief is that effective and practical usage of small data allows for groups and organizations to spend more time extracting value from the data rather than managing and collecting it. 

May 14, 2014

Why are data analysts so inefficient?

Image by Falkor, Krypt3ia

In an attempt to produce better products, explore new markets and defeat competition, companies spend huge amounts of time and money to create data processes that empower researchers, analysts and data scientists. While these processes take many different forms, most companies still aren't using data efficiently, and much of the time and money they spend on data is wasted. 

Taiichi Ohno, who is widely considered the father of lean manufacturing, defined what he believed to be the seven primary sources of waste in manufacturing. From five of these factors (conveyance, inventory, waiting, over-processing and correction) as well as one of our own (opacity), our list of "The Six Biggest Data Time Wasters" was born. In this series, we'll examine these time wasters and offer solutions on how to eliminate them.


Analysts and IT professionals waste time moving data back and forth between models, users and physical locations. Some examples of this include moving databases between users, linking spreadsheets, copying and pasting data from one location to another or syncing FTP sites to access data in multiple countries.

While many organizations try to address these needs by placing spreadsheets on mirrored shared drives or creating SQL databases, this often makes the data conveyance process both inflexible and unstable. One of our customers originally set up a series of five interconnected spreadsheet models that had a total process failure after one analyst moved his output data over just one column.

Here is a blog post on recommendations on how to reduce conveyance waste.


Collecting and holding unused data in inventory costs money. While the direct cost of storing data continues to fall, the human investment in updating or maintaining unused data as well as the opportunity cost of not using that data both continue to rise. Also, storing data that is not used or unusable can muddle the entire data ecosystem, making it harder to find the data you actually need.

The magnitude of inventory costs can vary, with examples ranging from the cost of preserving large, legacy databases all the way down to updating a small data set with superfluous data points. And while groups of all sizes face this issue, this problem is much more pronounced and costly within large organizations. One large ($10 billion +) energy company we talked to was spending millions a year to maintain and collect unused data whose value had not been determined.

Here is a blog post on recommendations on how to reduce inventory waste.


"Waiting" can be classified into two core categories - human and technological. On the human side, many organizations that contain "data bureaucracies" can inadvertently create work process bottlenecks that waste enormous time and cause frustration for data users. Many of our customers have historically assigned analysts (or in some cases interns) to manage specific data sets only to find that when that person is busy, on vacation or has left the company the data is not updated.

When it comes to technology, many organizations have updating or ETL processes that refresh too infrequently.  Because of the clunkiness of this type of setup, data users can be forced to wait for the data to update, which might not happen for another several hours or even days. 

Here is a blog post on recommendations on how to reduce waiting time.


In the spirit of Rube Goldberg, data often go through too many processing steps on the way to being useful. While these steps are intended to save time, increase consistency or standardize inputs or outputs, many companies go too far.

We have seen countless companies create structures that move data from one spreadsheet or database to another (conveyance) that adds no value. These steps can drive huge inefficiencies by inflating storage requirements, documentation time and drawing an unnecessary amount of human intervention.

Here is a blog post on recommendations on how to reduce over-processing.


Speaking of over-processing, the more complex the data process, the more error prone the results become. While this is true with any process, the complexity of data systems combined with tight project deadlines can create the ideal conditions for a mistake.

Even though these mistakes are both common and costly, quality checks are not present enough in data processes. Prior to engaging them, one of our clients was forced to redact an entire presentation built on a forecast with an out of date set of assumptions, embarrassing the team and the company while costing them future work.


Any data process that does not have documentation and transparency can result in disaster for any organization. The trigger for these catastrophes can be changes in personnel (the point person leaves), infrequent updates (the point person forgets how to do it) or simply changes in requirements (the point person needs to change his or her process).

Having seen countless companies pay the price for opacity, we believe that creating a culture that supports a transparent set of data process documentation is critical to maintaining the implicit value and going concern of your business unit or organization. This issue is discussed our Top Five Business Modeling Pitfalls blog post as well.

While no single framework can fully capture all the challenges associated with data efficiency, understanding these six factors will help organizations develop more productive team members and higher quality results. Over the coming weeks, we will publish a post for each of these six factors that provide detailed examples and suggestions on how to reduce their cost and impact.