Jul 7, 2014

The Efficient Analyst (part four): The over-processing epidemic

When we come across companies that need help with complex data workflows, the most prominent problem we see is over-processing. What do we mean by over-processing? Data goes through superfluous steps whose value is not seen by the end user.

Over processing can take many different forms:
  • Intermediate files or sheets are created to "stage" data. Maybe an output from one model is rearranged in a unique spreadsheet before it is loaded into another model.
  • Multiple human beings need to touch or analyze the data, and each person has their own structure or way of doing things.
  • Legacy spreadsheets that are complicated and intertwined, with a deep rooted fear of "rocking the boat".
Why is over-processing a problem? By increasing the number of nodes in a work process, things take longer, results are harder to audit and the risk of error increases dramatically.

Like many workflow problems, over-processing occurs most frequently during periods of tight deadlines or heavy work requirements. "Band-Aid" solutions are much more common, and reaching the deadline becomes more important than the manner in which it is reached.  Then, those processes and steps become ingrained in the process; and part of the future iterations. 

Streamline your workflow

After these heavy work periods are over, there is enormous value in evaluating workflows that are likely to be used again. By streamlining them into efficient, easy to understand processes, the value of the work can be raised hugely in the long term. Here are a few steps to follow to try and reduce instances of over-processing.

1. Evaluate the process. Take time to chart out each node in a particular process so that it's clear how it works in the current state. Share this workflow document with anyone who is responsible for touching the data on its way to the end-user. Ensure that each step is laid out. Here is a breakdown of a hypothetical workflow:

2. List out the demands of the end-user or customer. What is the purpose of your process? In order to fully assess those steps that are unnecessary, your team must have a full breakdown of what each step does to add value to the final product.

3. Determine which steps can be consolidated, automated or eliminated.  Revise your workflow so that it looks something like the chart below. It's okay to dream big. Indicate processes that you want to automate before you know how - then seek a solution that can work.

4. Implement and iterate. Test your changes. Revise as needed.

Jun 27, 2014

Microsoft Excel tip #3: Flash Fill

Every once in a while, Excel can surprise even us hardened pros. We discovered "Flash Fill" today (totally by accident) while working on some client work and it made our afternoon. Bear in mind we are using Excel 2013, so this might be a new feature. Check out the video below.

Have a great weekend!

Flash Fill in Excel from Shooju on Vimeo.

Jun 12, 2014

The Efficient Analyst (part three): No more waiting

Nothing kills analyst productivity like the waiting game. In a world with big demands and tight deadlines, hitting roadblocks to progress can result in huge frustrations and last-minute scrambling. Where does this waiting time come from? Quite simply, analysts must wait for data if they can't get it on their own when they need it. In part three of this series, we'll review the causes of analyst wait time and offer suggestions on how to fix it.

The drivers of wait time

Analyst wait time comes from two sources: humans and technology. On the human side, data bureaucracies or process bottlenecks can make analysts dependent on the availability of other data-capable colleagues, which can burden specific individuals and create internal strife. On the technology side, bottlenecks are caused by a lack of accessibility, such as permission restrictions. hard-to-use interfaces, or complex ETL processes that do not occur frequently enough to meet the needs of data users.

The human side

Process bottlenecks arise out of labor specialization. Perhaps there is a member of your team (let's call him or her "Data Person") who has been entrusted with managing all or some of your organization's data. Data Person might have the best data skills, or the volume of data might seem to require a point person to maintain it.  

It's important to note that Data Person is not JUST a data geek, IT professional or programmer. They are THE person (or persons) on whom the organization depends to get, use, filter or generally utilize data. There are some obvious problems with this. First, if Data Person is sick, on vacation or leaves the company, analysts don't have access to the data they need (or will waste time trying to figure it out on their own). Second, if Data Person is swamped by requests and new data updates, bottlenecks can cause analyst work to be delayed or create stress among the team (including and especially Data Person).

Good data persons try to document processes so that their absence or busyness does not slow things down. While documentation is nice, it does not solve the greater issue: the skill gap between Data Person and the average data user. 

The technology side

Challenges arise when technology creates a barrier to data access. This can be by design (access is restricted on purpose) or due to structural complexity (getting the data is simply too hard for the average data user). Either way, lack of access almost guarantees data bottlenecks. Even in cases where there are no permission restrictions on data access, hard-to-use data structures still mean that Data Person must query or extract the data for the user.

Reduce waiting time by closing the gap

Closing the gap that causes wait time must be approached from both angles. Technology must be used to create a system where data is easy to access and meets ever-changing analyst needs. This can be done creating simpler interfaces, less complex data structures or by empowering analysts to execute ETL jobs so that they are in greater control of their data they need in real time.

On the other side, Data users must be well trained and comfortable using data so that they are empowered to get what they need when they need it. Data training usually requires substantial effort to enhance the capabilities of the entire team, and it can take months to get everyone up to speed. This is not a quick fix - it's an investment in the future productivity of your organization.

But what about Data Person?

We are not advocating for an elimination of the Data Person role but rather a re-allocation of that persons skills. Rather than positioning him or her as reactive to the individual data needs of the organization, Data Person must be proactive in implementing easy-to-use technical solutions while training analysts to use it. Oh, and do this will also doing their regular jobs during the transition. Easy? Not so much. But give us a shout - we might be able to help.

Data persons or data users - do you have any thoughts on this? What's wrong with our approach? Are we being realistic about the capabilities of organizations to close "the waiting gap"?

Jun 6, 2014

Transparency Camp 2014: But how do I GET the data

On Friday, May 30th, I was lucky enough to give a talk at Transparency Camp 2014 on the challenges of data collection. I've made the slides available below. Many thanks to all who attended!


Jun 3, 2014

The Efficient Analyst (part two): Store only the data you need

The Indiana Jones Warehouse

The more stuff you own, the more time you spend managing it. 

While this insight from my mother was brought to life when referring to clothing and home furnishings, it is hugely applicable to the world of data storage and analysis.

Whenever we start a data project with a client, we are always shocked by how much data they store that is simply not used. Not data that they generate, either. One of our clients had an SQL database with 1,800 economic indicators stored for 210 countries going back several decades. After talking with analysts directly, we discovered there were three HUGE problems with this approach:

The first problem was that most of the data pulls included only two core indicators: GDP and population. The second was that most of the data pulls were for six unique countries (USA, Japan and the BRICs). And the final problem was that various parts of the data were updated monthly, which meant that at least one member of the analyst team was responsible for refreshing this data twelve times a year, taking him two to three full days to complete each update.

So, let's review:

The database in question had (1,800 indicators x 210 countries) 378,000 unique data series.

The most commonly used ones? Only twelve.

The company was spending 200-300 man hours a year to update one data set - of which 99.99968% was almost never being used. The associated cost of updating this data doesn't even include direct maintenance, storage requirements or troubleshooting.

But why?

When asked why they were holding on to this much unused data, the client gave the response that all hoarders give:

"Because we might need it"

Obviously this is not a good reason to store gigabytes of superfluous data. A successful database structure must be fluid enough that it can adapt to changing data requirements. Otherwise not only will you waste time storing and maintaining it, but your database will collapse under its own weight when changes arise.

Cure inventory waste

Here are four ways to prevent huge amounts of inventory waste in analytical databases.

1. Flexibility is everything. Flexibility allows you to store only the data you need; adding and subtracting data when your needs change. There are two ways to practice flexibility. First is the manner in which its structured and stored. SQL databases can be too rigid for some applications. While it is good for some data, make sure you look at NoSQL options like MongoDB or CouchDB that can be better for storing price strips or economic indicators.  Second, inserting and retrieving data must be easy. Simply put, if it is hard to put new data in or pull data out then adjusting to new data requirements become nearly impossibles.

2. Think about the use cases first. This applies to reducing many different kinds of data waste, but database architects and analysts must come together on needs and requirements before engaging in a project. Conversations between the two must be forward looking, thinking thoroughly about how the database will adapt to rapidly changing requirements.

3. Keep the data updated. This seems obvious, but data that is out of date helps no one. Keep it fresh through manual updating or an automated ETL process, and it will be used regularly. Otherwise, analysts will find different solutions (like going directly to the source) which will drive down the value of your database enormously. 

4. Use analytics to track usage. IT must understand what data their analysts are using, how frequently and for what purpose. By using services like SplunkKibana or the analytics in SQL Server, you can detect which data is being used and by whom, keeping your database lean and useful.

Readers: What did we miss? How do you and your team make sure that the data you maintain is being used effectively?

May 28, 2014

The Efficient Analyst (part one): Stop moving your data around!

Many organizations collaborate by moving data and information between people, files or applications.  While movement may provide analysts with the illusion of adding value, data movement is not the same thing as data sharing. While the latter is a good thing to be nurtured, the former causes version control problems, wasted time, and analyst frustration.

In our original post on why data analysts are so inefficient, we covered the six core causes of data waste. In today's post we'll tackle the first cause: conveyance.  Below are five ways to reduce data movement and put the focus back on sharing.

1. Stop emailing (IMing, Skyping) files back and forth. Email and chat programs are some of the most popular ways to move data around. While we are all guilty of doing this every now and then, it creates huge access and version control problems for your team. Stop doing it. The most effective alternative is to utilize your company's sharing environment to store and update files. Don't have a sharing environment? Go to #4.

2. Eliminate "staging" files or sheets. A good general rule is: do not put data into a spreadsheet, database or file that does not add specific value. Many use staging files when restructuring or reformatting data before being used by someone or something else. Eliminating these types of files not only eliminates waste and confusion, but also ensures faster assembly of process documentation.

3. Be careful with "movement automation". This category can include FTP syncs, VBA macros or any other kind of automated process that pulls data from one location into another. FTP syncs often create timing bottlenecks (to be discussed in a future post) or create confusion among users. VBA macro code in Excel can cause chaos when there are structural changes in either the source or destination spreadsheets. In general, any kind of data automation must be handled with a high level of care and used only when the risk outweighs the amount of effort it takes to manually execute the task.

4. Create a sharing environment. There are many products for sharing data across organizations. Some upload files to share drives or use Sharepoint or Dropbox, while others will build custom SQL databases to store raw data. Some use Shooju. Whatever option you choose, ensuring that it is a complete solution is critical to creating any form of internal data harmony. Because of the confusion they can cause without full adoption, partial solutions can be worse than no solution at all.

5. Google it. Some organizations have no love lost for the cloud, while others are rapidly making the transition. Wherever your company is positioned, Google Drive can be an excellent solution to specific sharing problems. While seasoned data analysts may scoff at this idea, there are great ways to track changes and store raw data sets that streamline shared data flow.

A word about sharing across applications

The suggestions in this post are most effective when moving data between individuals using the same application - i.e. moving an Excel file to another Excel user. Sharing data across applications is much more difficult. In many organizations, analysts that use more advanced analytical tools like SAS often have to export results into Excel spreadsheets (creating a "staging problem") and email them (gasp!) to colleagues who are non-SAS users. Additionally, keeping this staging files updated can be a big pain.

Readers: what tips do you have to solve data sharing across applications? Leave them in the comments below.

May 27, 2014

Open government and Shooju

This Friday and Saturday, Shooju will be attending Transparency Camp at George Mason University.  For this event we have been working with a team looking shed on light government spending and procurement practices through publicly available data. Check out a visualization tool that shows government spending by congressional district.