2023-04-25 - Scraping
- back to Le Wagon's Bootcamp log.
main topics
Parsing files in the following formats:
- CSV
require 'csv'
- JSON
require 'json'
- XML (implies HTML)
require 'nokogiri'
- ruby nokogiri
CSV
reading from a CSV file
require 'csv'
file = 'beatles.csv'
# "First Name","Last Name","Instrument"
# "John","Lennon","Guitar"
# "Paul","McCartney","Bass Guitar"
# "George","Harrison","Lead Guitar"
# "Ringo","Starr","Drums"
CSV.foreach(file, 'r', headers: :first_row) do |row|
full_name = "#{row['First Name']} #{row['Last Name']}"
instrument = row['Instrument']
puts "#{full_name} plays #{instrument}"
end
storing in a CSV file
require 'csv'
beatles = [
["John","Lennon","Guitar"],
["Paul","McCartney","Bass Guitar"],
["George","Harrison","Lead Guitar"],
["Ringo","Starr","Drums"]
]
file = 'beatles.csv'
CSV.open(file, 'wb') do |csv|
# NOTE: we can only store an array of Strings
csv << ['First Name', 'Last Name', 'Instrument']
beatles.each { |beatle| csv << beatle }
end
JSON
parsing a JSON string
Just use JSON.parse(json_payload)
. Example:
require 'json'
file = 'beatles.json'
# put file contents in a string
json_payload = File.read(file)
# parse JSON contents into a Ruby object
beatles = JSON.parse(json_payload)
# NOTES:
# 1. a JS Object becomes a Ruby Hash
# 2. a JS Array becomes a Ruby Array
converting a Ruby Hash into a JSON string
Just use JSON.generate(my_hash)
. Example:
require 'json'
# here's the Hash
beatles = {
"title": "The Beatles",
"beatles": [
{
"first_name": "John",
"last_name": "Lennon",
"instrument": "Guitar"
},
{
"first_name": "Paul",
"last_name": "McCartney",
"instrument": "Bass Guitar"
},
{
"first_name": "George",
"last_name": "Harrison",
"instrument": "Lead Guitar"
},
{
"first_name": "Ringo",
"last_name": "Starr",
"instrument": "Drums"
}
]
}
file = 'beatles.json'
File.open(file, 'wb') do |file|
json_payload = JSON.generate(beatles)
file.write(json_payload)
end
XML / HTML
Note: requires installing the nokogiri
gem. See https://nokogiri.org/.
Basically this:
require 'nokogiri'
def get_html_doc(url)
html = URI.open(url).read
Nokogiri::HTML.parse(html)
end
doc = get_html_doc('https://meleu.sh')
# then you can use #search with a CSS Selector
puts doc.search('h1').text
See also this code to list the top movies from IMDB: top_movies.rb