Data Modeling: How to Get the Most From Your Data w/ @CSDoctorSister (Episode 29) #DataTalk


– Hello and welcome to our
weekly Data Talk, a show where we feature data science
leaders from around the world. Today we are honored to
have Dr. Brandeis Marshall. She is the chair of computer and information sciences
at Spelman College. Before that, she served
as a professor of computer and information technology
at Purdue University. We have these chats every single week. If you’re watching live,
we welcome your comments, welcome your questions. Today’s topic is all around
data modeling techniques and strategies for getting
the most out of your data. Dr. Marshall, it’s an honor
to have you in our chat today. – Oh, thank you for having me Mike, and please call me Brandeis. – Okay.
(laughs) – All right Brandeis, and
for those that are listening to the podcast, if you want
to read a transcription or watch the video or
get any resources that Brandeis mentions, the short
URL is just ex.pn/marshall. That’s the URL you can go to. Brandeis, I always love
to kick these things off with kind of tell us your journey. What led you to begin
studying data science and then eventually teaching it? – Well, it started in
actually grad school, and I took a database
class and I loved the order of the data and being able
to construct some sense out of what is happening
with all this information. Of course, the information that I was looking at was just
text, but then I started getting interested in multimedia. I was starting to get interested in what’s happening with audio. I love music, so what’s happening with now this blend of audio
visual type of data. That’s where it really
started on my journey. After I finished graduate
school, I of course was teaching and I’ve been teaching databases
for almost 10 years now to undergraduates and graduate students. Yeah, I just have a passion for data. I’m a data nerd, data geek. (laughing) I like calling myself a data geek, but I just enjoy trying to figure out and problem solve how
to make sense of data. How do we do it? What are the constructs
that prohibit us to do it? How do you do it with messy data? How do you do it with missing information? All that stuff energizes me
and inspires me to learn more, ’cause it has not only
impact for an organization, but it also has impact
societally and socially, right? I mean, there’s a lot of new data sets that are coming out now
that it’s now being noted. There’s lack of representation
for people of color, for women, for people with disabilities. These issues need to be on
the forefront of our minds when we’re now constructing
and now looking and deep diving into this data. – Brandeis, you touched on
some really important issues. Before we dig into data modeling which all goes back to
what data do you have. You were just sharing, right? You’re just sharing, it starts with the data before we can model it. You don’t have accurate data
or well represented data, it’s going to mess up your
entire decisioning, right? – Right.
– Can you speak to some of these challenges, these problems that you’re seeing in society right now? – Well, I touched on them a little bit. Right now I see there’s
a lack of representation for women, for people of color, and then different social
economic type of challenges. I mean okay, so let me
back up a little bit. Just when you get a data set, it typically is a CSV file, a zip file. It might contain other
data sets and sheets within them that are in Excel or CSV. You don’t even know
what you’re looking at. You don’t know if the
column names are correct. Do they represent what the information is supposed to be inside of there? Are you able to take that information and then put it into a system, right? That’s assuming that you
have the right format. So there’s all these just initial issues when you receive some set of data. Modeling it then becomes
an extra challenge ’cause then you have
to now understand what you’re looking at and then
being able to interpret. You’re problem solving before
you even get to the problem. That’s the first issue. Then once you’ve made certain assumptions and interpretations, then
you move on from there. That is where you get into
this tricky situation of… how do I now make decisions… based upon my assumptions, assuming my assumptions are correct. How do I validate my own assumptions? That’s me as an individual researcher. That’s me in a group of researchers. That’s me with my students. How do I know what I’m
looking at actually is correct and how do I then move forward? It’s so rich of questions and challenges. – And that gets into the
ethics right there, right? ‘Cause what biases we bring
to the data set will possibly just ruin whatever it is that the data might be speaking one
thing, but because you come to the table with a certain
bias, a certain assumption, it’s now going to twist
it to tell your own story. – Exactly, ’cause you can make data say anything, even if
it’s modeled correctly. Let’s say that it is accurate. You have the right column
names, the contents of all of your data is correct, but
you can manipulate that data. You can make a positive seem negative. You can have a negative seem to be a positive with enough finesse. There’s definitely this
notion of how do you critically think through the data and then what is your
assumptions about what that information coming out
is really going to tell you. What story are you trying to arrive at and are you manipulating the data so that you arrive at
the story you anticipate, or are you letting the
data tell you the truth? That’s really the ethics part of it. You would think data
modeling is very easy. It’s very structured. Okay, we have relational databases. We have object orients
and we have time series type of databases, but at the core of it, it’s someone, a designer, that is trying to now put some parameters
or some guardrails around how this data is
being placed into a system, whether that system is PHP or JSON, or something as large as Bigtable, or anything that’s like
Apache Spark, right? No matter what, there’s always this human intervention when
it comes to data modeling. There’s this human
intervention when it comes to interpreting the data and
then what are those outcomes and results that you hope to achieve, and then what is the surprising outcome and results as well, and
what do you do with that? – So Brandeis, as you’ve been
teaching for over 10 years, training upcoming statisticians
and data scientists. How do you help them with this challenge of coming to a data set objectively and trying to put away assumptions? How do you kind of train your students or teach them this process? – Well, the first part is
always collecting all the data. There’s two different veins. You can be collecting data or you can be using secondary
or tertiary type data. I try to have students do
their own data collection so they understand the challenges of what decisions they need to be making and what assumptions that they’re having. So that’s the first part. They collect their own
data before I move them into let’s use someone else’s
data and see what happens. The other thing that I do is give them a lot of case studies. There’s a lot of books that are available. I happen to use Hoffer’s textbook which is Modern Database Management. It’s very well-known within
the database community, but they have a lot of large problems. They call it field exercises. Those tend to work very
well, so I kind of try to piece and parse those out as well as talking with the students
about these problems. I will provide an example and
then start working through it, because one of the challenges
in training up individuals in this space is you have
to know what the problem is. That means you really
have to read critically and you have to make sure that you have some type of advanced
reading comprehension. What are you reading, how are you interpreting what’s being read, and then how are you going to represent that? When they have this data set, they now are in the knowledge that
they don’t know everything and that’s difficult for a lot of novice people to realize and understand. But at the end of the day,
they then appreciate it. I always have students
in my database class do some sort of semester long project where they’re either
constructing their own database, a system using some company
that already exists, or they’re modifying
someone else’s version of a company’s database structure. That way, they start to see are these business rules
being implemented correctly? What type of business rules
would make more sense? What business rules can be
implemented in the database? What business rules need
to be done through queries? They really start to piece
out how they need to think through the data model and
what the limitations of it is. Hopefully when they get
to their industry job or they move on to graduate studies, they then can see data in… see it for what it is. It is a hot mess most of the time. (laughing) But at the very least they
have the resources in order to tackle the hot mess
and put some constructs around it to make it manageable for them. That’s what I think every company’s trying to do within data science. It’s trying to manage the crazy, trying to manage the chaos of their data. They don’t know what it is,
where they’ve gotten it. How does it all fit together? But that is what makes
data science beautiful, ’cause everyone can come
in and have an opinion and have justifications
that they can work toward those interpretations to
make sense for the company. – I love it, it is a hot
mess ’cause you think about all the data and every
year, companies are getting more and more data, whether
they’re purchasing it or just collecting it naturally. I heard this example
from another broadcast where someone was saying it’s kind of like those Where’s Waldo books.
(laughing) Where you open up the pages
and there’s all these people and you’re like, “Where is Waldo?” Where’s the data you actually need that’s going to actually be
important to the business? So yeah, it is a hot mess. – It is, it really is. There’s no other term. I know it’s so technical. (laughing) But it really is. Okay, there is a technical term for it called messy data, but–
– I like hot mess better. – But hot mess is really what I see. (laughing) You know, with my students, we’ve been working with black Twitter. As you know, there has
been a lot of conversation about the Oscars and the
representation of people of color being nominated and
then being awarded Oscars. We started this work in 2016, when there was no representation
of black people to 2017, when there was a lot of representation, to 2018 where it was very much in the spirit of the Me Too movement. So that intersectionality between gender and race was a conversation. My students have been gathering Twitter data during the Oscars. – Oh, that’s awesome. – That has just been
an experience for them and for me that has been wonderful because they now have a reason and a purpose to collect information. They have a reason and a purpose to try to figure out what is the trends? What is the words being spoken? Who is coming up in the Twitter feeds? What countries, what cities? Who are the influencers in this area? Having them really take a forefront and be the leaders in spearheading and pioneering that
type of work is awesome, especially because they’re undergraduates. That doesn’t happen very often, but they’re on the forefront of that. I’m just glad that it is something that I thought of the idea,
but they have been able to just roll with it
and take all of the… I don’t know what this data’s telling me. I don’t know what this
graph is telling me. Then we talk through it, right? We don’t need all these different columns. We only need these. We’re parsing through it. Oh, there’s so much data. How do we harness this data? How do we put it into Python… and how do we use pandas effectively? These are great questions and things that everyone in the data
science world needs to know. – What a wonderful way to
get your students involved and get them passionate
and get them curious about different ways of
working with the data, especially with such a great political… The political climate that we’re in, the racism that’s going on. Then having your students get involved with actually looking at the Twitter data, looking for racism, looking for maybe even predictions about who might
win the Oscar possibly. That’s awesome that you’re
getting them involved. Also I think it’s amazing
that you get your students with this huge project to
actually build their own data set. I mean, what a massive
project that must be and what a great learning
experience for them. – Yeah, I really am enjoying it. I mean, they are really scraping the Twitter data themselves. They’re creating their own
data sets and then having to parse through those and
try to understand those. That’s what it takes in data science. It’s practice, right? So your original question
was what do you do? It’s a lot of practice at the core of it. You have to get your hands
dirty, roll up your sleeves, and you have to fail. You have to get to a
place where you’re like, I don’t know what I’m doing but
let me try to figure it out. Then you figure one
thing out and you might take two steps back, but
that is what it takes within data science
because the data itself doesn’t provide any valid,
quality information. That’s up to the humans to do. You have to continue to press
forward in that persistence. There’s a lot of language around grit. I think that’s what it
takes within data science is this grit to continue to persevere, to have a passion for trying to understand and provide quality information
about what is being housed in these black boxes that
we are calling data sets. – Yeah, that’s awesome. I love that you use the
word grit because yeah, when you have a huge data
set and you don’t know what you’re looking at,
you’ve got to be gritty. You’ve got to be curious
to start digging through to figure out what should the labels be? Are the labels accurate? What am I looking for? Then also if you’re
working for a business, what is the business goal? What is the challenge, the pain point that you’re trying to solve? – And sometimes you don’t
know that in a business. (laughing) In a business, you just know that your ROI is too high or too low. One department has too many applicants. Another department doesn’t
have enough applicants. How do we balance this out? Sometimes you don’t know what
it is, but that discovery and that curiosity that you’re mentioning and that you’re speaking to is very much on point, very, very much on point. – So as you’re teaching these classes and helping these young
professionals work with data, how do you guide them for which
models they should be using? And maybe even before we get to that, what are some of the
common database models that you kind of
encourage them to look at? – So I encourage them always to start with relational databases and that is the most structured data
model that’s out there. The reason why I focus
on that fundamental, not just because that’s where
I entered into the data world but also because a lot of organizations still use relational databases. You want to be relevant
and you want to make sure that you’re able to enter a company and be able to understand their model. Relational databases that can
include a MySQL, a PostgreS, or an Oracle if you had a
very large organization. So that’s where we start. Then there’s other models
that then they can easily translate their knowledge of
relational databases to others. That could be NoSQL, Cassandra. There could be more
distributed type of databases. Then you can move into big data databases like that’s housed within Spark and Hadoop and things like that. You can look at how are you going to understand what’s important in your data? If you care about the
timing, then you might want to do maybe a NoSQL Cassandra. It has its pros and cons,
but if you want to look at more structured, then you
might want to do a relational. If you care more about web interfacing, then you definitely want
to do a JSON, right? There’s no JS in other
versions that are aligned with Java type libraries,
but you have to have an idea of how you’re going to
use the data in order to then decide what type of
database you want to house it in. It might be a combination, right? You might be storing information inside a relational database as kind of like an archival type system, but then when you do the processing,
you’re using JSON. That happens all the time. (laughing) ‘Cause you have to translate the data. That’s all that you’re doing. 80% of the work is really
in cleaning the data, so that does sometimes mean it’s moving it from one system to
another system construct when it comes to the data model. That’s where I start the students and that’s the conversation
that I have with them. – And that’s the hashtag hot
mess, cleaning up that data. (laughing)
– Right. ‘Cause it’s interesting
in how students will, and even myself, I get into this problem. I just kind of assume oh, once I have it inside of a database, whatever that is, I can easily translate it into JSON. Why do I make that assumption every time? I’ve been in this field too long. Why do I make that assumption? Because not everything
is a one to one matching. You always are learning something new. There’s always a new update. There’s always a new
method that’s being shared online via Twitter, via Facebook, and just different data communities. You’re like, oh, that’s deprecated. All right, I can’t use that method. (laughing)
Let me use something else. Oh, that library is no longer the trending hot thing to use anymore. It has these X, Y, Z limitations. I need to move over to
another suite of libraries. So that is always that learning
objective that you need to have that curiosity and
that grit I spoke to earlier. – I love that. Now what’s your recommendations
for professionals that are looking at a data set
that’s all unstructured data, maybe even a third data
that has video, pictures. Where would you begin
with choosing a model? – Wow, so the first
thing I probably would do and I have done this is I
ignore all of the emojis and all of the video
and audio information. I do that–
– No hands? No hand emojis?
– No. (laughing) No hands, no thumbs up. We’re getting rid of all of those because those are all being translated into text as computer language.
– I’m crying now. (laughing) – I remove all those, only because… it does tend to cloud the judgment. What I do want to keep in
the data set is the URLs. That is very important. The emojis not so much right now, even though more and more
people are just speaking in emojis, which is hurting my heart a little bit because I do like words. I am an educator, so it does
vex my spirit a little bit that we’re only speaking in emojis, but– – 100.
(laughing) – But I do want to use… natural language processing techniques. There’s so many that
are readily available. I think that’s the first
step is just trying to remove all of the impediments or blockades to you actually getting
to valuable content. Text first and then you
can build upon there. Where do you start outside of that? I am a Python person. There’s basically two
camps in data science. You’re either Python or you’re R. That’s about 90% of it. I’m a Python person, so I
would say Python, pandas. You’re going to use
NumPy, matplotlib as far as those packages in order
to create reusable code. Now let’s say for instance
that you’re not a coder. You haven’t looked at code. You don’t want to deal
with being a designer. You want to really be using tools. I would say bring it into an established tool like Tableau… would be your entry point. But if you’re curious
about where you might land on that spectrum, there’s
different resources you can try. I have used Berkeley has a course called Foundations of Data Science. It is available. Most of the materials are available that you can start working
through some of these techniques and understanding linear regression. You get a little bit
of Python introduction. There’s other tools and other companies, non-profits such as Data Carpentry. They provide lessons that are online that you can download
and they help set you up. They move forward. They even provide instructors if you want to create workshops,
one, two day workshops. They provide that. I’ve also used a relatively new… organization called DataCamp. That’s something that
a lot of professionals, a lot of companies like to use in order to help scaffold their
employees who need to work with data and really kind
of hone their data skills. There’s a lot out there
outside of just Coursera (laughing) and free courses, but
it’s a landscape that is having more and more
companies and organizations trying to figure out how to
teach and how to make sure that there is retention
in these data skills. Those are just a few.
– Wonderful. We just got a question
on Facebook from Joan who says, “What do you think about knime?” – KNIME.
– Have you heard of that? – Yes, yes. That is a wonderful
platform that is really meant for I think someone that has some understanding of
computer programming, algorithmic design, more specifically that is quite familiar
with how to deal with code. So yes, I think that’s a
wonderful platform, yes. – Well I can’t believe
we’re almost out of time. I do have a couple last questions for you. – Oh no, it’s only half an hour. – I know.
(laughing) Do you have any final tips
for those who are just starting out with data modeling, just some tips, suggestions to help them? – I think it’s really about your mentality when you are approaching data modeling. It’s a lot of don’t be frustrated. Everyone goes through this. Resources are plentiful. There’s a lot of database books in order to help you get some understanding, and I can share those with you afterward. I mentioned the Hoffer textbook. That’s just one, but there
are several other ones that are pretty much staples
in the database world and hence the data science world. One day at a time, just try
to learn one thing at a time. And practice, last thing
is going to be practice, practice, practice,
practice, practice, practice. (laughing) Collaboration, collaboration is necessary so if you do have accountability
buddies, more than one, I think that is awesome because
you can now talk through some of the challenges
’cause that’s what you do when you are constructing a data model. You have to talk to the
client multiple times, and that client sometimes is you. Sometimes it is someone
in the organization. But you need to collaborate. That’s a big misnomer
about computing in general, but also data science is that you have to talk to people multiple
times to really fine tune and hone what they’re looking
for and why they’re looking at this data a certain way
and that interpretation to really help move that
conversation forward. – Wonderful. What is your advice
for those lot of people in our community who
are in graduate school? They want to become a
data scientist or some that are just starting college like, “I think data science is the path for me.” What is your advice for them on… just the whole process of getting
started with their career? – I think there is a number
of things that you can do. It’s never hopeless, even
if you’re at an organization or a school that does not
have a data science program. There are courses that
you can take and learning outside the classroom that
you can participate in. Let me give a couple of examples. If you’re at a university
institution of higher ed, statistics courses, computer
science programming courses are always helpful in this conversation. That’s the fundamentals,
’cause data science really is a blend of mathematics,
statistics, computer science. Then there is the discipline
specific science or field… that you are completely
in if you have at least the mathematics and the statistics and the computer science
courses under your belt. That’s introduction to programming. That is statistics. That is pre-calc, calculus. Those are all helping to build
your logic reasoning gates in order to now be able to enter into this data world
informed and able to make the proper interpretations
or at least be able to ask the right questions in order
to get you to the end result. That’s the main advice that I
have for any college students or graduate students that
want to enter this space. Then it’s a matter of
then doing some research about what type of programs exist. There’s a lot of programs
that are coming out now that’s blending data science
with every discipline, with economics, business
analytics it tends to be called, or with journalism. That’s when you have digital
media and digital humanities. There’s blending data science with English ’cause of all of this
transcription, right? You mentioned you’re going
to be transcribing this. – That’s right, that’s right. – There you go.
(laughing) There’s a whole field and
how do you now parse through this set of data which
happens to be all these words that are being said
over this past half hour and how do you work with
that set of information? Then it’s a matter of
then trying to understand where your passions might
lie as you work within this data science field
which is incredibly broad. Be patient, be persistent,
and you’ll be just fine. (laughing) – I am loving this interview. I am so sad it has to end. Okay, just one last question, then I promise I’ll let you go. – This happens all the time. All students say, “I have
one question for you.” (laughing) More than one, it’s okay. It’s completely fine.
– I’ll have you back– – I don’t need to eat lunch at all. It’s fine, no worries. (laughing) – Advice for the data science
leaders that are listening in right now who are building
up their data science teams. They’re looking into
bringing the best candidates. What is your advice for kind
of the interviewing process, questions to ask, what to
be looking for to bring in a good variety of people
who can help work on data? – You know, I think this really goes back to what most companies
want, which is people who can communicate effectively. That’s the number one… talent that everyone is trying to acquire. It’s not just about what
knowledge they know. It’s about whether or not
they’re able to communicate. Are they able to collaborate? When it comes to interviewing, it’s all about the conversation. It’s all about what type
of problems are you giving that allows the candidate in order to expand and express what they know. Is what they’re expressing
something that you as a company interviewer can work with, and then how does that interview let you know of their potential? Because in data science,
there’s no one that’s going to come in ready to do
exactly what you want. Let’s push that to the side. Let’s really talk about how
can you grow the data skills within each of your employees
and if those individuals that you’re hiring are
amenable to that growth in alignment with the
company’s goals and objectives. – I love that. I’m going to leave that right there. Brandeis, I got to have
you back at some point because at the beginning
of this conversation you were talking about music and I go oh, I want to have a data
science chat about music. – Yes!
– I got to have you back. I got to have you back for sure. – Oh, most definitely.
– For those who joined late, this is Dr. Brandeis Marshall. She is the chair of computer and information sciences
at Spelman College. If you want to get links
to her LinkedIn profile so you can follow her, to her website, you can go to the Experian blog where there’ll be a full transcription at the end of this week
of today’s episode. The short URL is just
ex.pn/marshall, M-A-R-S-H-A-L-L. If you are new to the Data Talk show, we have a whole list, our whole archive of all past shows over at ex.pn/datatalk. We have I think about 27 live interviews now that are recorded there. So that’s where you can go to get that. Dr. Marshall, thank you again so much for your time and just
sharing your insights. I love talking with you and thank you for sharing your insights
with our community. – Oh, thank you. This has been fantastic and I look forward to doing this again.
(laughing) – Awesome, can’t wait.
– All right. – Okay everybody, thanks for joining us. We’ll see you all next week. – All right, bye-bye.