How To Find and Hire Data Scientists

Dobrin was hired by IBM two years ago to build out the Data Science Elite … data science and AI, and isn’t just jumping on the big data bandwagon.


So you’re building a data science team. That’s great news! As a business leader, finding a qualified data scientists is a critical step in your company’s ability to harness big data and machine learning technologies, which is a competitive advantage. But it’s also a process fraught with difficulty and pitfalls. We reached out to data science leaders to get their thoughts on the matter.

One of the most important steps to building a successful data science team is hiring a senior data scientist who can lead the further development of the data science team, says Seth Dobrin, who heads up IBM‘s Data Science Elite Team.

“Until you get a credible senior person in your organization that’s a data scientist, it’s hard to get others to come on board,” says Dobrin, a PhD with more than 20 years of experience in data science fields. “There are some clients that just can’t find talent.”

Dobrin was hired by IBM two years ago to build out the Data Science Elite Team, which is a new endeavor where IBM data scientists engage with organizations in six to 12 week engagements to collaborate on data science and AI projects. The service is free to customers, although there are some requirements (like your willingness to serve as a public reference).

After travelling the world to meet with IBM clients for a year, Dobrin set out to assemble the team, which currently consists of 60 data scientists, machine learning experts, and others with related expertise. Dobrin is currently looking to hire 30 more data scientists this year – which means he might be competing with you.

Respecting Elders

In Dobrin’s view, hiring a senior data scientist signals to other prospective data scientists that the company is serious about data science and AI, and isn’t just jumping on the big data bandwagon. The newly hired leaders will also be able to exercise their own professional networks to fill out the data science team, much as Dobrin has done himself.

Hiring a senior data scientist is a good way to get the ball rolling

“The hardest part is getting that first person in who has that deep network, who can bring in additional talent,” he says. “We all work for people. We don’t work for companies. We go change jobs to go work with someone, not necessarily to work for a specific company.”

In some situations, the company will rely on the senior data scientist for setting its data science and AI strategy. Ideally, however, the company will already have ideas where they want to apply data science and AI technologies and techniques, and the senior data scientist is brought in to execute those ideas with process and rigor.

“Ideally it comes from the top,” Dobrin says of the idea-creation process. “In an ideal situation, it’s the CEO. That’s a rare situation though. Usually it’s one or two people who get it. It’s the CIO or CFO or CMO who gets it, that starts pushing us and starts driving it within the company and getting the resources.”

If a company is struggling to come up with the big idea, there are plenty of consultancies that can help with that. System integrators like Deloitte, KPMG, E&Y, and PwC all have large staffs of data scientists and others who are expert at analyzing business models and figuring out where data can give them a boost.

Headhunters can also be hired to bring in an experienced data scientist to get things started. “It is a little bit of a chicken and egg problem if you don’t know how to build that value proposition,” Dobrin acknowledges.

Dobrin used familiar tools and channels to work his network, including making phone calls, emails, and LinkedIn. Getting the right job description on job boards is critical to clearly communicating the role, and acting quickly on nibbles is also important to hooking the big fish.

“If you take six weeks to go through an interview process from first contact to offer, you’re going to lose people,” he says. “My goal is 10 days.”

Skills Matter

When it comes to specific technical skills, there are a few areas of expertise that are absolutely critical, such as Python. If you don’t know Python in this day and age, you had better have a hard-to-find talent in some other area. Apache Spark has become a critical tool in many data scientists’ toolboxes, so being familiar with how to use it is important. R is still a popular language for data science too.

On IBM’s Data Science Elite Team, XGBoost has become the go-to algorithm for traditional machine learning problems, thanks to its power, tunability, and forgiveness of overfitting, according to Dobrin. “There’s a constant barrage of new tools, methodologies, and packages that are out there that people just need to be up to date on,” he says.

Graduating from a data science bootcamp is a good start, but it’s not enough to consider yourself a full-fledged data scientist, says Pedro Alves Nogueira, who heads up the data science and AI business at Toptal.

“There are not a lot of people in the market with proven experience,” says Nogueria, who has a PhD in AI, human-computer interaction and affective computing from the University of Porto in Portugal. “Doing a bootcamp on AI and data science is probably not going to be enough for you to be a data scientist. It’s good enough for you to learn a skill…but it’s not going to give you the basic mathematical knowledge.”

Toptal prides itself on having the top 3% of talent – hence “Toptal” – in a given field of development. The company started by offering developers on demand for general application and Web development. As more clients looked to Toptal for data science and AI expertise, the company decided to formalize its data science and AI business by creating a dedicated department.

Education Rules

While data science is becoming more automated by the day, it’s still critical that a data scientist know how machine learning models work at a deep level, and to be able to build them by hand, if necessary, Nogueria says.

“We allow developers to use whatever technology they want,” Nogueria says. “What we’re most interested in is having the fundamental ability to understand the models and to implement them from scratch.”

(Courtesy: Western Digital)

Becoming one of Toptal’s data scientists or AI experts is a rigorous process, Nogueria says. The first step is ensuring that the prospective data scientists are proficient in English, which is important considering that the company heavily recruits from eastern and southern European universities. Next, they must prove their mathematical chops by solving a series of ML and AI problems.

“Then somebody on existing screening team, which is himself or herself a senior AI or data science developer who has been working with us for two years, [takes the prospect] through a live coding session,” he says. “Then you have to do a two-week sample project that you have to present to us as if we’re the client. We spend a lot of energy and time to make sure they really know what they’re talking about.”

Ultimately, pairing a Toptal data scientist with a specific client takes careful analysis of the business outcomes that are sought and the capability of the worker to fulfill the technical requirements.

“It’s not just about building models,” Nogueria says. “It’s about knowing what you’re building and making sure what you’re building is intelligible to people who are going to be using it, and that it is solid and useful for the business itself.”

Once you find a good data scientist, retaining them is also important. Providing good data science problems that impact the bottom line is arguably the best way to keep them around. Giving them the freedom to learn new technologies and techniques is also important. Of course, offering them a competitive salary and good benefits are critical too.

Related Items:

How To Build a Data Science Team Now

What Kind of Data Scientist Are You?

Taking the Data Scientist Out of Data Science

IT managers look to deploy BI, analytics in the cloud

IT managers are looking to position BI and analytics in the cloud. … Big data platform/enterprise data warehouse/data lake (26.5%) rounded out the top …

IT managers are looking to position BI and analytics in the cloud.

That’s according to TechTarget’s 2019 IT Priorities Survey, which asked a total of 624 IT professionals from a wide range of industries in North America about what’s on their to-do lists.

The survey posed the following question to respondents: Which of the following applications are you most likely to deploy in the cloud this year? Among the 231 IT professionals who responded to this question, the top response was BI/analytics (27%), followed by customer relationship management (23%) and big data platform/enterprise data warehouse/data lake (21%). In last year’s survey, respondents said they were most likely to deploy CRM (34%), ERP (29%) and business process management (27%) in the cloud.

Respondents who plan on deploying data warehouses, data lakes, BI and analytics in the cloud this year are in alignment with a growing enterprise trend, according to experts.

Jen Underwood, senior director at machine learning software vendor DataRobot, said the results are “not at all surprising.”

“Analytics in the cloud is usually ahead of on-premises offerings,” Underwood said. “With rapid weekly updates, on-demand scale, speed and ease of simply getting things done, cloud is a no-brainer for many organizations outside heavily regulated industries.”

Application deployments planned for 2019

Isaac Sacolick, president of StarCIO and author of Driving Digital, said, to drive efficiencies, improve customer experience and target optimal markets, organizations are going “from centralized BI functions to more distributed analytics teams supported by citizen data scientists using self-service BI tools.” Deploying BI and analytics in the cloud is a sensible next step in that transition.

With rapid weekly updates, on-demand scale, speed and ease of simply getting things done, cloud is a no-brainer for many organizations outside heavily regulated industries.
Jen UnderwoodSenior director, DataRobot

“Cloud offerings enable organizations to quickly and more easily ramp up BI tool usage, provide access to more data sets and scale usage of produced analytics with less effort by IT to enable and support infrastructure,” Sacolick said. “IT is then better poised to partner with the business on data governance, integration and modeling initiatives that fuel ongoing analytics needs.”

Underwood and Sacolick aren’t alone in their thinking. Feyzi Bagirov, data science advisor at B2B data insight vendor, said he also is seeing more organizations deploying BI and analytics in the cloud, but that the trend is still in the early stages. He cited 2018 Gartner research that found on-premises deployments still dominate globally, ranging from 43% to 51% of deployments.

Data governance, predictive analytics are priorities

The 2019 IT Priorities Survey also asked respondents what information management initiatives their companies will deploy in 2019. Among the 215 IT professionals who responded to this question, the top response was data governance (28.8%), followed closely by predictive analytics (27.9%) and data integration (27.9%). Big data platform/enterprise data warehouse/data lake (26.5%) rounded out the top four.

Bagirov said he thinks these results more or less align with enterprise trends. He said that priorities may vary by industry — companies in the financial sector might be more inclined to push data lake initiatives, for example.

Data governance and integration will top IT professionals’ objectives this year, Bagirov said. “Those are the steps that are essential before predictive analytics can be scaled up,” he said.

Management initiatives planned for 2019

As for Underwood, she said the European Union’s recent rollout of GDPR likely influenced data governance’s top placement in the survey. Governance probably won’t be as prominent next year, though, she said.

“In my machine learning and artificial intelligence work … I am seeing early adopters achieve astounding results that I have never seen happen throughout my entire 20-plus-year analytics career,” Underwood said. “The artificial intelligence gap is already being exploited as a game-changing competency for competitive advantage in the algorithm economy. As a result, I forecast predictive analytics to be No. 1 on your ranking next year. Artificial intelligence and automation is changing analytics as we know it today.”

Related Posts:

  • No Related Posts

Advantages of graph databases: Easier data modeling, analytics

The rising tide of big data is one of the factors prompting more users to consider the advantages of graph databases and graph data modeling …

Powering the grid with graphs

Guangyi Liu has begun to work with TigerGraph’s massively parallel processing graph database as part of an effort to build a system that matches electricity supply-and-demand on the fly.

Bringing real-time analytics performance to electrical power distribution has been a holy grail in the utility industry, said Liu, CTO at the Global Energy Interconnection Research Institute North America. GEIRINA is an R&D center in San Jose, Calif., that’s affiliated with the State Grid of China, a government-owned utility based in Beijing.

Liu’s team is looking to do large-scale linear equation processing on a topology representing signals from the millions of sensors, actuators, relays and switches on a power grid. The project, which began in 2015, originally tested out Oracle’s relational database software. But like Jarasch, Liu found drawbacks to the relational approach.

Guangyi Liu, CTO, GEIRINAGuangyi Liu

“With the Oracle database, you need to convert tables into a data structure representing the topology of the system,” Liu said. With TigerGraph, however, “the topology is right there,” he added. The graph database also makes it possible to do data searches and calculations in parallel, according to Liu.

Philip Howard, an analyst at London-based Bloor Research, said he expects the use of graph technology to continue to expand. In particular, he pointed to the advantages of graph databases over relational software for the large-scale “who knows who?” questions that underlie many modern applications.

Yet graph tools are currently often used as an adjunct to relational databases or other types of NoSQL systems, if at all, Howard said. Graphs may offer a more natural way to model and connect data, he noted, but IT teams tend to think “inside the box” when evaluating and selecting data management platforms.

Related Posts:

  • No Related Posts

On Amazon, You Can Get Books, Baby Wipes, and Blockchain

Amazon Web Services announced today that its marketplace will offer private blockchain technology created by Kadena, a company spun out of JP …

Amazon Web Services announced today that its marketplace will offer private blockchain technology created by Kadena, a company spun out of JP Morgan’s blockchain development efforts. Kadena’s blockchain service boasts security and scalability, claiming to have supported up to 8,000 transactions per second across 500 nodes. For comparison, Ethereum can handle about 15 transactions per second, and bitcoin can take on about seven.

The technology is available for basically free on the AWS marketplace under the name “Kadena Blockchain for Enterprise, ScalableBFT: Community Edition” (though there’s an “estimated infrastructure cost” of $0.10 per hour for using Amazon’s Elastic Compute Cloud). It supports up to four nodes per user and 2,000 transactions per second, offering features like contract governance, trust-free escrows, and bug detection. The paid version (which you have to directly contact Kadena to use) will support more nodes and deliver on that maximum 8,000 transactions per second promise.

The new Kadena offering won’t be AWS’s first push into blockchain. In November at Amazon’s AWS re:Invest conference, the company announced Amazon’s Quantum Ledger Database (a blockchain-esque centralized ledger) and Amazon Managed Blockchain, an Amazon-owned interface layer that businesses can use to navigate their own blockchain networks. Kadena will not integrate with either of these products, Kadena CEO Will Martino told BREAKER. He’s more concerned with his company’s technology integrating with cloud application stacks such as Oracle, so as to better serve enterprises looking to integrate with “normal [aka, non-blockchain] databases.”

So why did the company decide to offer its product on the AWS marketplace? AWS had more than one million active enterprise customers at last public count, which was in 2017, and Amazon as a whole has millions more. Unsurprisingly, Kadena’s services are on Amazon for the adoption, Martino said. When asked who would be adopting Kadena now that it’s on AWS, however, Martino was a bit vague, citing the industries of some of their current clients (he believes insurance and healthcare will be the earliest users) and the general need for scalable, private blockchain solutions for enterprises.

At JP Morgan, the technology that preceded Kadena, called Juno, ran internal international payments between London and Tokyo through the U.S. “Banking was the first group to really try out blockchain technology and experiment with it,” said Martino, but he believes bankers will be “the last to adopt it.” Putting Kadena’s services on AWS gives other industries the chance to use this technology that might otherwise languish at banks.

Don’t let the admittedly boring nature of this announcement fool you. Amazon continues to shill more and more products every day, increasing its breadth to the point where it can name customers belonging to any demographic, in any vertical. If you can now buy your blockchains on Amazon, too, will people even remember the technology’s freeing promise of decentralization, especially when the best-known public blockchains have a lot less to offer in terms of scalability? Or will they forget that they could use open-source technology to create their blockchains, just like they’ve forgotten they can venture out to a physical store to buy toothbrushes?

Related Posts:

  • No Related Posts

Snowflake, DataRobot Partner To Bring AI to Data Warehouse-as-a-Service

Snowflake Computing is partnering with DataRobot and its advanced enterprise automated machine learning platform. The goal is to bring AI to huge …

Snowflake Computing is partnering with DataRobot and its advanced enterprise automated machine learning platform. The goal is to bring AI to huge caches of enterprise data held in Snowflake cloud-based data warehouse installations

by Craig Gehrig

Tags: AI/ML,analytics,cloud,DataRobot,data warehouse,DWaaS,Snowflake,

Snowflake Computing is partnering with DataRobot and its enterprise automated machine learning platform to bring AI to huge caches of data held in Snowflake cloud-based data warehouse installations.

The deep product-level integration brings together Snowflake’s scalable and manageable cloud-based data warehouse as a service functions with DataRobot’s AI and analytics capabilities.

As a result, Snowflake’s data warehouse services can amass a vast amount of an organization’s data and then make it simple for DataRobot’s AI/ML capabilities to access it – all without disrupting the performance or operations of the data. Users are free to obtain deeper data-driven insights for enterprise level users, according to Snowflake vice president of alliances, Walter Aldana.

“We’re committed to providing our customers with the tools they need to empower data-driven organizations — an imperative for every organization today. The partnership with DataRobot extends our ability to provide customers with access to automated machine learning technology that will uncover business-driving predictive insights,” Aldana added in a statement.

Seann Gardiner, DataRobot’s senior vice president of business development added, “Businesses today know they need to embrace AI to stay ahead. Our partnership with Snowflake makes it easier than ever for users of all skill levels to access data and build, tweak, and deploy machine learning models to make business decisions.”

Snowflake’s cloud-based data-warehouse-as-a-service (DWaaS) architecture offers a single, logically integrated solution to supply full relational database support for both structured data (e.g. CSV files and tables) and semi-structured data (e.g. JSON, Avro, Parquet, etc.).

Snowflake also offers discrete metadata processing. Notably, Snowflake’s architecture separates compute and storage services, allowing them to run independently of one another. This means Snowflake can scale near-linearly — as your compute needs and resources scale out. As a result, Snowflake enables data warehouse managers to support enterprise-wide data warehouse requirements with virtually unlimited concurrency.

For developers, Snowflake also a powerful query processing back-end platform to help them create modern data-driven applications.

Data is also kept secure. The Snowflake services layer is built for security. It authenticates user sessions, enforces security functions, and performs query compilation. Queries are compiled and metadata is used to decide the micro-partition columns that need to be scanned. It also provides all security and encryption key management.

Snowflake is also compatible with popular ETL and BI tools.

All these features set the stage for DataRobot’s enterprise automated machine learning platform to safely, securely and at high-performance access vast amounts of Snowflake-resident data – and deliver insights in compelling ways. In specific, the DataRobot platform delivers top levels of automation and simplicity for machine learning initiatives.

It allows companies to create and deploy powerful machine learning models, without the time and expense of a traditional data science process. Each data model is unique to your dataset and prediction target. They automatically search through millions of combinations of algorithms to determine the best learning model for your data.

Internal to the platform, DataRobot sports a massively parallel modeling engine that can scale to hundreds or even thousands of powerful servers to explore, build and tune machine learning models.

Developers can use DataRobot to easily build and analyze machine learning models, and embed it using a variety of options. Just a few lines of code executes the power of DataRobot to test hundreds of solutions that can combine data preparation with a range of open source algorithms from R, Python, Spark, H20, and many more.

DataRobot’s APIs for Python and R show complete transparency into the model building process, which allows developers to iterate their models, and even compare solutions for speed and accuracy. Another benefit in DataRobot is that any model is production-ready. With the APIs, users can operationalize machine learning models for real-time predictions, batch deployments, or scoring on Hadoop.

Beyond its AI/ML capabilities, DataRobot also sports other important features for data-driven insights. Among them:

  • High-availability
  • Distributed and self-healing architecture
  • Integrates with existing enterprise security
  • Hadoop cluster compatibility
  • Multiple database certifications

Elsewhere, Snowflake also is partnering with global solutions provider Saggezza to reduce costs for data-related projects and expanding ways in which the data can be interpreted. With Snowflake, Saggezza can aggregate all sources of client data into a single source for easy viewability, providing a 360-degree view of data and making it easier to attain actionable insights from the data, according to execs from both firms.


Related Posts:

  • No Related Posts