About the data

This exam is motivated by the blog post by Peter Ellis on polls leading up to the Australian Federal election, and the most recent blog post from election day. A copy of the data can be downloaded from or read directly from here. Download and read the data into your R session.


  1. (1pt) What was the earliest and latest dates of polls being conducted in the data provided?
 Min.   :2007-11-20  
 1st Qu.:2012-01-17  
 Median :2013-09-03  
 Mean   :2014-03-15  
 3rd Qu.:2016-06-02  
 Max.   :2019-05-15  
  1. (1pt) How many different firms have conducted polls in this data?
# A tibble: 12 x 2
   firm                n
   <chr>           <int>
 1 Essential        1922
 2 Newspoll         1562
 3 Roy Morgan       1210
 4 ReachTEL          323
 5 Nielsen           234
 6 Ipsos             224
 7 Galaxy            192
 8 YouGov             63
 9 Election result    25
10 AMR                18
11 Lonergan            6
12 YouGov/Galaxy       6
  1. (3pts) Use your internet search skills. Who are these pollsters? What organisations own them? How does each organisation collect their data? Write a paragraph explaining what you have managed to find, and what you couldn’t find. (Focus your attention on the firms who are frequently making polls. )

Newspoll is associated with The Australian newspaper, which is owned by Murdoch Media empire. However, https://en.wikipedia.org/wiki/Newspoll is administered by Galaxy, and owned by international market research and data analytics group, YouGov. The latest polling information is displayed at http://www.newspoll.com.au, but it does not give details on how the data is collected.

Essential is associated with the Guardian newspaper. They maintain a panel of 100,000 members, and draw from this panel about 1000 for interviews each week. It has an aim of 50/50 male/female ratio of over 18 years olds. Data is sourced from Your Source, another company.

Ipsos is a specialist polling organisation with no apparent affiliation with news organisations or political parties. In the most recent poll, they sampled 1,842 people, using random digit dialing of mobile phone numbers.

Roy Morgan is an Australia market research company. It is independent, and the company now operates globally. Their most recent polling data was collected by asking respondents “Regardless of who you have or will vote for who do you THINK will win the Federal Election?” Data was collected on 3,004 voters, by SMS.

  1. (2pts) Is the data in tidy form? Explain your answer.

Yes! It is in long form, where every measured value intended_vote is identified by numerous characteristics, dates, firm, preference type, party.

  1. (2pts) Have all of the polling firms been operating for the same time period?
# A tibble: 12 x 3
   firm            first      last      
   <chr>           <date>     <date>    
 1 Newspoll        2007-11-20 2019-05-16
 2 Election result 2007-11-24 2016-07-02
 3 Roy Morgan      2010-07-17 2019-05-12
 4 Essential       2010-08-13 2019-05-14
 5 Nielsen         2010-08-21 2014-05-17
 6 Galaxy          2011-08-03 2019-04-25
 7 AMR             2013-03-22 2013-08-18
 8 ReachTEL        2013-05-02 2018-08-06
 9 Ipsos           2014-10-30 2019-05-15
10 Lonergan        2016-05-06 2016-05-08
11 YouGov          2017-06-22 2017-12-10
12 YouGov/Galaxy   2019-05-13 2019-05-15

There is a lot of difference in the operating time frames of the pollsters. The main ones have been consistently polling for a decade or more. Others have popped up and disappeared, e.g. AMR. And Nielsen, which was a major operator, stopped conducting polls in 2014.

  1. (3pts) Are the pollsters all reporting similar numbers? Compute the five number summary (min, q1, median, q3, max) of Lib/Nat intended_vote, separately for each pollster, and sort from highest to lowest median value. (Be sure to drop the actual election results.) Write a few sentences explaining what you learn, particularly focusing on the initial question which relates to pollster bias.

The median intended vote varies among pollsters. With Nielsen generally providing much more favorable results for Lib/Nat, and Ipsos the least. It suggests that the pollsters may either be biased towards one politcal party or another, or that their collection methods sample different types of people.

  1. (3pts) Using the actual election results, what has been the vote recorded by the Lib/Nat for each of the elections, 2007, 2010, 2013, and 2016. What is the average of these numbers? If the polls were accurately reflecting the actual vote, across all these years, what would be the expected average for each pollster? Using these numbers refine your explanation from the previous question in relation to pollster bias.
# A tibble: 4 x 2
  start_date     m
* <date>     <dbl>
1 2007-11-24  47.3
2 2010-08-21  49.9
3 2013-09-07  53.5
4 2016-07-02  50.4
# A tibble: 1 x 1
1      50.3

The percentage vote for Lib/Nat has varied at each election. In 2007, they lost to ALP, but attained government in 2010, 2013, and 2016. The average percentage across this time was 50.275. Thus the polls would be expected to be centred around this value. That is clearly not the case for many pollsters, with many having medians above or below this number.

  1. (3pts) Make a plot of intended vote (two party preferred) for Lib/Nat by time of poll. Add a loess smoother that will allow the reader to look at the rough average of the polls, and hence see how the voting public are trending over time. Overlay the actual election results (as points). Include a baseline at 50% that will show the critical juncture when the outcome would likely be a change in government. Coming into the election last Saturday (18/5/2019), what did it look like the result would be?