• Home
  • About Me
  • Comment Policy

The PR Post

Feeds:
Posts
Comments
« Featured Content From My J452 Classes
Advice for PR Job Candidates During the COVID-19 Pandemic »

Building a Boolean search for big data projects

June 27, 2018 by Tiffany Gallicano

magnifying glass on white background with clipping pathIf you are ready to pull data and are looking for advice on how to build a strong Boolean search string, this blog post is for you. I don’t post to this blog often, but I have received several requests about this topic, so I decided to write a blog post about it. I’ll use the Charlottesville protest as an example because my work is in social media and activism.

Discover your keyword pool
Conduct an Advanced Search on Twitter for your event dates and try different keywords to examine tweets. Do the same for hashtag searches during the time of the event through Advanced Search on Twitter. Next, you will sort keywords in an OR paragraph or an AND paragraph that you create in a Word document.

Build a list of OR words
First, identify any anchor word that is undoubtedly about the event you want to capture. For example, HeatherHeyer would be an obvious anchor word choice for the Charlottesville data because you can reasonably expect that anyone talking about her on Twitter is going to be talking about her in the context of the Charlottesville protest. Continue making a list of any word that uniquely connects with your event. Those words will be words that you connect with OR in your Boolean search string.

Build a list of the AND words
Next, you will make a list of “and” words by identifying combinations of words that immediately capture your event. In the context of Charlottesville, you would write something like “KKK and cville” if you think that individually, these words will not capture your event. Put all of your word pairs together and connect them with OR.

After you have finished your OR and AND lists, connect the two lists in a giant search string with OR as a connection word.

Add NOT words if needed (make sure to test your words)
You can easily capture irrelevant data if any of your search terms refer to other people, places, and events than you want to capture (such as a city that has the same name as the last name of a key person). You can test your search terms by individually looking up your OR words and by individually looking up each AND pairing in Advanced Twitter Search to see if irrelevant results pop up, especially if you do not sort by your event date.

Finally, connect the OR list and the AND list with “NOT.”  You can use parentheses for complex search strings (see my example below).

Example
My Charlottesville data pull from GNIP resulted in so many tweets that we had to break up the request into three pulls. You will see that I have some

Pull 1: (Charlottesville OR cville OR VA OR Virginia OR McAuliffe OR @CvilleCityHall OR @VSPPIO) AND (antifa OR Nazis OR Nazi OR neo-Nazi OR Nazi/KKK OR KKK OR (white supremacy) OR (white supremacists) OR (white activists) OR (white activist) OR (James Alex Fields))
Timeline: May 7 to Oct. 12, number of tweets: 3 million approximately
 
Pull 2: (Charlottesville OR cville OR VA OR Virginia OR McAuliffe OR @CvilleCityHall OR @VSPPIO) AND (antifa OR Nazis OR Nazi OR neo-Nazi OR Nazi/KKK OR KKK OR (white supremacy) OR (white supremacists) OR (white activists) OR (white activist) OR (James Alex Fields))
Timeline: Feb. 7 to May 7, number of tweets: 40,000
 
Pull 3: (Charlottesville OR cville OR VA OR Virginia OR McAuliffe OR @CvilleCityHall OR @VSPPIO) AND (statue OR memorial OR (Robert E Lee) OR (Lee Park) OR (General Lee) OR Confederate OR (Emancipation Park) OR (Stonewall Jackson) OR protest OR march OR marchers) OR cvilleaug12 OR #invisiblecville OR #HeatherHeyer OR #DeAndreHarris OR (DeAndre Harris) OR #unitycville OR #defendcville OR #cvillestrong OR #standwithcharlottesville
Timeline: Feb. 7 to Oct. 12, number of tweets: 2.8 million approximately

Remember to filter out bots
You’re not out of the woods yet. Once you have your data, make sure to use a method for filtering out bots if you’re doing any theory-building about people’s behavior. There were enough political bots in the Charlottesville data to affect our topic modeling, and this is a fundamental step. Look for identical tweets, nearly identical tweets (because bots can swap out adjectives to try to evade capture), and tweets that tag a bunch of people with the same link. 

Final thoughts for now
One reason I love being at UNC Charlotte is the access to big data and the institutional support for collaborating on interdisciplinary projects! My thanks go to Ryan Wesslen for training me (he is an incredible teacher of more advanced topics, as well). If you have additional tips for search strings or can improve my post, feel free to leave your feedback in the comments area.

 

Share this:

  • Share
  • Twitter
  • Print
  • Facebook
  • Email

Like this:

Like Loading...

Related

Posted in research | Tagged academic research, big data, boolean, data pull, search terms, Twitter |

  • Tiffany Gallicano, Ph.D.

    I teach PR classes and conduct research at UNC Charlotte. I only post on occasion to this blog due to other priorities.
  • Categories

  • Recent Comments

    robelmark on Featured Content From My J452…
    breetcalvin on Featured Content From My J452…
    Ismail on An example of how to perform o…
    Morgwn on An example of how to perform o…
    Amandine on An example of how to perform o…
  • Twitter

    • My husband, @toddcg, is on set right now for the filming of his screenplay for the Christmas season on @netflix. S… twitter.com/i/web/status/1… 3 months ago
    • Looking forward to a great discussion! Thanks, @amandajweed! #PRDVirtualConf twitter.com/amandajweed/st… 4 months ago
    • @drsmadden Hang in there, Stephanie! You will be so happy when your little one arrives! ❤️ I learned from a senior… twitter.com/i/web/status/1… 5 months ago
  • Practitioners and Educators

    • Beth Evans
    • Beth Kanter
    • Bill Sledzik
    • Dave Fleet
    • Denise Bortree
    • Ellie Boggs
    • Heather Yaxley
    • Kami Huyse
    • Karen Freberg
    • Karen Russell
    • Katie Delahaye Paine
    • Kaye Sweetser
    • Kelli Matthews
    • Kevin Brett
    • Les Potter
    • Phil Gomes
    • Porter Novelli's Intern Blog
    • Portland Senior Experience
    • PRSA
    • Richard Edelman
    • Robert French
    • Ron Culp
    • Shel Holtz
    • The Buzz Bin (Group Blog)
    • Tina McCorkindale
    • Todd Defren
  • Social Networking Profiles

    • LinkedIn
    • PR Open Mic
    • Twitter
  • Delicious

  •  Subscribe in a reader

  • Top Posts

    • An example of how to perform open coding, axial coding and selective coding
    • Advice for PR Job Candidates During the COVID-19 Pandemic
    • About Me
  • Blog Stats

    • 443,249 visits

Blog at WordPress.com.

WPThemes.


Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy
  • Follow Following
    • The PR Post
    • Join 176 other followers
    • Already have a WordPress.com account? Log in now.
    • The PR Post
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Copy shortlink
    • Report this content
    • View post in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...
 

    %d bloggers like this: